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Abstract 

We give fast algorithms for ip regression and related problems: for an n x d input matrix A 
and vector b G M", in 0{nd\ogn) time we reduce the problem min^-gRrf \\Ax — b\\p to the same 
problem with input matrix A of dimension S x d and corresponding b of dimension S x 1; A 
and b are a coreset for the problem, consisting of sampled and rescaled rows of A and b. Here S 
is independent of n, and polynomial in d. Our results improve on the best previous algorithms 
when n ^ d, for all p G [1, oo) except p = 2, in particular the 0{nd^'^'^^~^) running time of Sohlcr 
and Woodruff (STOC, 2011) for p = 1, that uses asymptotically fast matrix multiplication, and 
the 0{nd^\ogn) time of Dasgupta et al. (SODA, 2008) for general p. We also give a detailed 
empirical evaluation of implementations of our algorithms for p = I, comparing them with 
several related algorithms. Among other things, our results clearly show that the practice 
follows the theory closely, in the asymptotic regime. In addition, we show near-optimal results 
for £i regression problems that arc too large for any prior solution methods. Our algorithms use 
our faster constructions of well-conditioned bases for £p spaces, and for p = 1, a fast subspace 
embedding: a matrix 11 : M" i-^ ^0{d\ogd)^ found obliviously to A, that approximately preserves 
the £i norms of all vectors in {Ax \ x € M''}; that is, « ||nAa;||j^, for all x, with 

distortion O(d^). Moreover, HA can be computed in 0{ndlogd) time. Our techniques include 
fast Johnson-Lindenstrauss transforms, low coherence matrices, and rescaling by Cauchy random 
variables. 



*IBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 USA. Email: klclarks@us.ibm.com 
^Dcpt. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180. Email: drinep@cs.rpi.edu 
*Dept. of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180. Email: magdon@cs.rpi.edu 
^Dept. of Mathematics, Stanford University, Stanford, CA 94305. Email: mmahoney@cs.stanford.edu 

^ICME, Stanford University, Stanford, CA 94305. Email: mengxr@stanford.edu 

"iBM Almaden Research Center, 650 Harry Road, San Jose, CA 95120 USA. Email: dpwoodru@us.ibm.com 



1 



1 Introduction 



Gaussian random projections provide low distortion subspace embeddings in the ^2-iiorm, mapping 
an arbitrary d- dimensional subspace in M" into a d-dimensional subpace in with r = 0{d), and 
distorting the ^2-iiorm of each vector in the subspace by at most a constant factor. The embedding 
is oblivious in the sense that it is implemented by a linear mapping chosen from a distribution 
on mappings that is independent of the input subspace. Such low distortion embeddings can be 
used to speed up various geometric algorithms, if they can be computed fast enough. The Fast 
John-Lindenstrauss transform (FJLT) is such an embedding, computable in 0{nlogd) time, using 
the fast Hadamard transform [IJ. This leads to fast algorithms for constructing orthonormal bases, 
^2-i'egression, and ^2-subspace approximation, which in turn lead to faster algorithms for a range 
of related problems plfTHlfTO]. 

Analogous to the Gaussian projection for £2, the Cauchy Transform was introduced in |21j to 
provide low distortion embeddings for the £i-norm. The Cauchy Transform consists of a dense 
matrix of Cauchy random variables, and so it is "slow" to apply to an arbitrary matrix A, but 
it provides the first analog of the Johnson-Lindenstrauss embedding for the ii norm, and it can 
be used to speed up randomized algorithms for problems such as £1 regression and ii subspace 
approximation [21]. 

We present a Fast Cauchy Transform (FCT), which is an ii analog of the FJLT. As claimed 
by Theorem it gives an 0(d^ log^ d)-distortion embedding that can be computed in 0{ndlogd) 
time. To achieve this result we use a combination of a low coherence "spreading" operator together 
with scaling by Cauchy random variables. 

Our main application of our FCT embedding is to constructing the current fastest algorithm 
for computing a well- conditioned basis for ii (Theorem [2]); such a basis is an analog for the ii 
norm of an orthonormal basis for the ^2-iiorm. This improves the result in jSlJ. We also give a 
generalization to ip in Section [5] 

The main application for well-conditioned bases is to regression: if the rows of A are sampled 
according to probabilities derived from the norms of the rows of such a basis, the resulting sam- 
ple rows (and corresponding entries of b) are with high probability a coreset for the regression 
problem: a small subset of the input that determines a solution that is also a good approximate 
solution for the original problem. Based on our constructions of well-conditioned bases, we give 
the fastest known construction of coresets for ii regression, of size ^ poly((i, log -), and achiev- 
ing a (1 + e)-approximation guarantee (Theorem [3]) . Our construction runs in 0{ndlogn) time, 
improving the previous best algorithm of [21] which required 0{nd'^~^) time. Our extension to 
finding an £p well-conditioned basis algorithm also leads to an 0{ndlogn) time algorithm for ip- 
regression, improving the 0{nd^ logn) algorithm in [?]• We further optimize the running time for 
p = 1 to 0(n(ilog(e^^(ilogn)) (Appendix [e|) , and we generalize this result to multiple regression 
(Theorem [4]) . One application of multiple regression is to subspace approximation, and we give the 
current fastest algorithm providing a (l+e)-approximation guarantee for £1 subspace approximation 
(Theorem [s]) . 

Finally, we also provide the first empirical evaluation for this class of randomized algorithms. 
In particular, we provide a detailed evaluation of a numerical implementation of both of our FCT 
constructions, and we compare the results with an implementation of the (slow) Cauchy Transform, 
as well as a Gaussian Transform and an FJLT; these latter two are ^2-based projections. We evaluate 
the quality of the £i-well-conditioned basis, the core component in all our geometric algorithms, 
on a suite of matrices designed to test the limits of these randomized algorithms, and we also 
evaluate how the method performs in the context of £1 regression. This latter evaluation includes 
an implementation on a terabyte-scale problem, where we achieve 10~^ relative error approximation 
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to optimal, a task that was infeasible prior to our work. 



2 Preliminaries and Cauchy Tail Inequalities 

Let A G M"^'^ he an n X d input matrix, where we assume n ^ d and A has full column rank. The 
task of regression is to find the vector x* € M'^ that minimizes \\Ax — b\\ with respect to x, for a 
given b € M" and norm || • ||. Here, we focus mostly on the £i norm, though discuss extensions to 
ip for any p > 1. Recall that, for p G [l,oo], the £p norm of a vector x is \\x\\p = (Y^ - \xi\Py^^, 
defined to be maxj \xi\ for p = oo. Let [n] denote the set {1,2, . . . ,n}; and let A^^^ and A^^^ be 
the ith row vector and jth column vector of A, respectively. For matrices, we use the Frobenius 
norm \\A\\'^p = Y17=iYl'j=i ^ij^ ^he ^2-operator (or spectral) norm \\A\\2 = sup ||x||2=i ll^2;||2, and 
the entrywise ip norm \\X\\p = (Yli j (the exception is p = 2, where this notation is used 

for the spectral norm and the entrywise 2-norm is the Frobenius norm). 

The standard inner product between vectors x, y is {x, y) = x'^y; Cj are standard basis vectors 
of the relevant dimension; and In denotes the n x n identity matrix. Lastly, c refers to a generic 
constant whose specific value may vary throughout the paper. 



Sums of Cauchy Random Variables. The Cauchy distribution is the unique 1-stable distri- 
bution having density p{x) = ^j^^- If Ci, . . . ,Cm are independent Cauchys, then Yi^[J^,■^ ^iCi 
is distributed as a Cauchy scaled by 7 = Ylii<^[M] l7«l- The Cauchy distribution will factor heav- 
ily in our discussion, and bounds for sums of Cauchy random variables will be used throughout. 
The following upper and lower tail inequalities for sums of Cauchy random variables are proved in 
Appendix [Aj 

Lemma 1 (Cauchy Upper Tail Inequality). For i G [m], let Ci he m (not necessarily independent) 
Cauchy random variables, and 7i > with 7 = Yl,ie[m] ~ Ylie[m] 7«|C'j|- Then, 

„ r n 2 /log(l + (mt)2) \ 41og(mt) , , 

Remark. The bound has only logarithmic dependence on the number of Cauchy random variables 
and does not rely on any independence assumption among the random variables. Even if the 
Cauchys are independent, one cannot substantially improve on this bound due to the nature of 
the Cauchy distribution. This is because, for independent Cauchys, X^j7i|Ci| > IX]j7i^«l 
the latter sum is itself distributed as a Cauchy scaled by 7. Hence for independent Cauchys, 
Pr[X > 7t] > f tan-i t = n{]). 

Lemma 2 (Cauchy Lower Tail Inequality). For i £ [r], let d he independent Cauchy random 
variables, and 7^ > with 7 = XlieH '^^ Sie[r] - 7^//3^- Let X = Yli&[r] Then, for 

t > 0, 

Pr {X < 7(1 - t)] < exp ' ^ 
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We will also need an "£i-sampling lemma," which is an application of Bernstein's inequality. 

Lemma 3. Let Z G M"^'^ and suppose that for i G [n], a||Z(j)||_^ < Aj < For s > 0, define 

Pi = min{l, s ■ Aj/ Y2ie[n] ^i}' ^'^^ ^ M"^" be a random diagonal matrix with Da = 1/pi with 

probability Pi, and otherwise. Then, for any (fixed) x G M*^, with probability at least 1 — 5, 



;i - e)||Zx||i < \\DZx\\-^ < (1 + (^)\\Zx\ 



1' 
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3 Our Main Result: the Fast Cauchy Transform 



Here, we present the Fast Cauchy Transform (FCT). We start, in the next subsection, by describing 
the FCT and stating our main result (Theorem [T]) , which provides running time and quahty-of- 
approximation guarantees for the embedding. In the next subsection, we provide the proof. 



3.1 Construction and properties of the FCT 

Let 5 £ (0, 1] be a parameter governing the probabiUty of failure of our algorithm, and 77 > is a 
generic arbitrarily small positive constant (whose value may change from one formula to another). 
Our construction first preprocesses by a low-coherence "spreading matrix,", rescales by Cauchy 
random variables (hence the name Fast Cauchy Transform) and finally samples linear combinations 
of the rows. Later we will give a second approach based on the £2-FJLT [l\ that is slightly weaker 
but generalizes to £p for p > 1 (see Section Isl) . 



FCT via a low-coherence matrix We construct Hi as 

Hi = ABCH, 

where 

B £ ]^i'ix2n j^g^g each column chosen independently and uniformly from the ri standard basis 
vectors for W"^; for a sufficiently large, we will set the parameter ri — -'^i'-"-'^ 
controls the probability that our algorithms fail and a is a suitably large constant; 
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C G ig2nx2n jg ^ diagonal matrix with diagonal entries chosen independently from a Cauchy 
distribution; and 

H £ ]g2nxn jg ^ block-diagonal matrix comprised of n/s blocks along the diagonal. Each block is 
the 2s X s matrix Gg = [ ] ; where Is is the s x s identity matrix, and Hs is the normalized 
Hadamard matrix. We will set s = (96ri)^. (Here, for simplicity, we assume s is a power of 
two and n/s is an integer.) 

'Gs 

Gs 



H 



Gs 



Heuristically, the effect of H is to spread the weight of a vector, so that Hy has many entries that 
are not too small (as discussed in Lemma [4] below). This means that the vector GHy comprises 
Cauchy random variables with scale factors that are not too small, and finally these variables are 
summed up by B, yielding a vector BCHy whose ii norm won't be too small relative to ||y||i. For 



this version of the FCT, we can prove the following result (see Appendix 3.2 for the proof). 



Theorem 1 (Fast Cauchy Transform (FCT I)). There is a distribution over matrices Hi G M**!^" 
with ri = 0{dlogd + dlog |), such that for an arbitrary (but fixed) A G W^^'^, and for all x G M*^, 
the inequalities 



Ax\\^ < \\UAx\ 



1 — ^ 1 1 -^"^ 1 1 \ 



hold with probability 1 — S, where k = log(rid)). Further, for any y G M", the product Hy can 

be computed in 0(n log ri) time. 

Setting 5 to a small constant, k = 0{d'^ log^ d) in the above theorem. 

Remark. The existence of such Hi satisfying Theorems [T] and [9] was established by Sohler and 
Woodruff [2T]. Here, our goal is to show that Hi can be factored into structured matrices so that the 
product HiA can be computed in 0{ndlogd) time. We also remark that, in additional theoretical 
bounds provided by the FJLT, high-quality numerical implementations of variants of the Hadamard 
transform exist, which is an additional plus for our empirical evaluations of Theorem [T} 

3.2 Proof of Theorem [l] 

Preliminaries. Before presenting the proof, we describe the main idea. It follows a similar line 
of reasoning to |21] , and it uses an "uncertainty principle" (which we state as Lemma |4] below) . To 
prove the upper bound, we use the existence of a {d, l)-conditioned basis U and apply Hi to this 
basis to show that ||niC/x||]^ cannot expand too much, which in turn means that ||niAa;||]^ cannot 
expand too much (for any x). To prove the lower bound, we show that the inequality holds with 
exponentially high probability, for a particular y; and we then use a suitable 7-net to obtain the 
result for all y. 

Main Proof. We now proceed with the proof of Theorem [T| We will first prove an upper 
bound (Proposition [T| and then a lower bound (Proposition [2]) ; the theorem follows by combining 
Propositions [T] and [2] 

Proposition 1. With probability at least 1 — S, for all x G M'^, ||niAx||i < where k = 

0(^log(nd)). 

Proof. Let U S M"^"^ be a {d, l)-conditioned basis (see Definition [l] below) for the column space of 
A, which implies that for some Z G W^^'^ we can write A = UZ. Since ||ni^x||i < if and 

only if ||niC/Zx||i < k||?7Zx||i, it suffices to prove the proposition for U. By construction of U, for 
any x G M*^, ||x||oo < and so 

||niC/x||i < ||niC/||i||x||oo < ||niC/||i||c/x||i. 

Thus it is enough to show that ||niC/||i < k. We have 

||niC/||i = \\BCHU\\i = Y.WBCHU^^'^Wi = Y,\\BCU^^^\\i, 

ie[d] ie[d] 



where U = HU. We will need bounds for ||f/''-'^||i for j G [d], and For any vector y G 

T 



we represent y by its n/s blocks of size s, so Zi G and y^ = [zj , Z2 , . . . , z'^,]. Recall that 



Gs = and observe that ||Gs||2 = \/2. By explicit calculation, 

lli^ylli = E ll^-^^lli- 

is [n/s] 



Since < v 2s||Gs2;i||2 < v4s||zj||2 < v4s||2j||i, it follows that 

\\Hy\\i < E ||zj||i = \/4i||y||i. 

is [n/s] 
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Applying this to y = U^^^ for j G [d] yields 

||{7(j')||i < V4^||f/(^')||i, and ||C/||i = ^ HfX^'^Hi < V4i||f/||i < dv^, (1) 

i6[d] 

since \\U\\i < d because U is (d, l)-conditioned. 

The (i, j) entry of BCU is J2k ^ikCkkUkj, which is a Cauchy scaled by jij = \BiktJkj\- So, 

\\BCU\\,= Yl ^iAj^ 
ie[ri],je[d] 

where Cij are dependent Cauchy random variables. Using '^^Bik = 1, we obtain: 

= X] \^ikUkj\ = "^lUkjl^^Bik = ^ \Ukj\ = \\U\\i. 
i,j i,j,k j,k i j,k 



Hence, we can apply Lemma [l] with 7 = \\U\\i and m = rid to obtain 

Pr \\\BCU\\i > t\\Uh] < + (1 + 0(1)) . 

L J vrt 

Setting the RHS to 6, it suffices that t = log(dri). Thus, with probabilty at least 1 — 5, 



|i?CC/||i = 0( ^log(dri)||?7||i') = O (^V~slog{dri)] =o('^log{nd) 



□ 



Before we prove the lower bound, we need the following lemma which is derived using a spar- 
sity result for matrices with unit norm rows and low "coherence," as measured by the maximum 
magnitude of the inner product between distinct rows {Gg is a matrix with low coherence). This 
result mimics results in [9l |8l [131 ESI US] . 

Lemma 4. For G = ["^"j and any z G W , ||G2;||i > ^s-'-/^||2:||2. 

Proof. We can assume ||z||2 = 1, and so ||G2;||2 = 2. Let G(5/) be k rows of G, with k of them 
coming from Hg and k — k from Ig- G(5/-)G^,^ = / + A where A is a symmetric 2x2 block matrix 



-^Q^ 



75" 



where the entries in Q G M'^><('= «) are ±1, and so ||Q||2 < \/ i^{k — k) < \k. 
\\Gis')\\l = \\G,s')GTs42 < 1 + IIAII2 = 1 + ^IIQIb < 1 + ^ 



Now, given any z, we set k = 2/3-y/s with /3 = |, and choose G(^s') to be the rows corresponding to 
the k components of Gz having largest magnitude, with G(5) being the rows with indices in [s] \ S' . 
Then ||G(5')-2||2 < l + and so the entry in Gf^gi-^z with smallest magnitude has magnitude at most 
a = y/{l + p)/k = v^(l + /3)/2/3s-i/^. We now consider Since \\Gz\\l = 2, > 1-/3; 

further, all components have magnitude at most a (as all the components of G(^g^z have smaller 
magnitude than those of G(^s/-jz). ||G(5)z||i is minimized by concentrating all the entries into as 
few components as possible. Since the number of non-zero components is at least (1 — /3)/a^ = 
2/3s"^/^(l — /3)/(l + /3), giving these entries the maximum possible magnitude results in 

\\G^S)4i > a X = (1 - /3)v'2/3(l + /3)si/^ > Oms'^^ 

(where we used /3 = |). We are done because ||G^;||i > ||G(5)2:||i □ 
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We now prove the lower bound. We assume that Proposition [T] holds for Hi, which is true 
with probability at least 1 — 6 for k as defined in Proposition [TJ Then, by a union bound, both 
Propositions [T] and [2] hold with probability at least 1 — S — 2exp(— n + dlog{8dK)) {S and k are 
from Proposition [1]) . The final probabilty of failure is at most 5 + 2exp(— ri + (ilog(8d/t)) < 26 by 
choosing ri = ad log ^ for large enough a because k = log (rid)). 



Proposition 2. Assume Proposition 
probability at least 1 — 2exp(— ri + d. 



i] holds. Then, for all x G 
og{8dK)). 



\IliAx\\i > j\\Ax\\i holds with 



Proof. First we will show a result for fixed y G 
Lemma 5. Pr n|niy|L < i||y|U < e^''^ 



summarized in the next lemma. 



Given this lemma, the proposition follows by putting a 7-net T on the range of A (observe that 
the range of A has dimension at most d). This argument follows the same line as in Sections 3 
and 4 of |21j . Specifically, let L be any fixed (at most) d dimensional subspace of M."' (in our case, 
L is the range of A). Consider the 7-net on L with cubes of side ^/d. There are {2d/^)'^ such 
cubes required to cover the hyper-cube \\y\\^ < 1; and, for any two points 2/1,2/2 inside the same 
7/(i-cube, \\yi — y2||i < 7- From each of the ^/d-cuhes, select a fixed representative point which we 
will generically refer to as y*; select the representative to have ||y*||i = 1 if possible. By a union 
bound and Lemma [5} 



Pr 



min llHiy* 

y* 



\y 



li<5 



< {2d/^r 



We will thus condition on the high probability event that ||ni2/*||-^ > ||y*||-^/2 for all y* . For 
any y £ L with ||y||^ = 1, let y* denote the representative point for the cube in which y resides 
{\\y*\\i = 1 as well). Then \\y - y*\\ < 7. 



Iiniyii, = \\Uiy* + ni(y - y*)\\, > \\Uw*h " lini(y - ylh > 



irlll 



f^Wy-y 111, 



where the last inequality holds using Proposition [T] By choosing 7 = 1/(4k) and recalling that 
\\y*\\i = 1, we have that ||niy||-^ > with probability at least 1 — exp(— ri + d'[og{8dK)). 

All that remains is to prove Lemma [5] As in the proof of Proposition [T| we represent any vector 



y G M" by its n/s blocks of size s, so zi G M'^ and y 



T 



Gszi 



G. 



We have that H^Hg = HGsZjUg = 2^- ||zj||2 = 2||y||2, and 



Iff 



n/J- Letg = Hy, 



'-ill 2 



1/2 



^ 1/4|| II 



We conclude that \\g\\i > ^^■s ' ||ff||2) which intuitively means that g is 'spread out'. We now 
analyze ||ni?/||^ = HiJCgH]^ = Bji\Cii\\gi\. We proceed similarly to the proof of Lemma [2| Let 
Zi = min{|Cii|,M}, where M is such that B[Zi] = 1 (M « 1.6768); then B[Zf] < §. Clearly 



|niy||i > Y^ij BjiZi\gi\. Let X = Y^^- BjiZi\gi\ (note that j G [n]). 



B[x] = Y,nBji]nzi]\gi\ = \\9\ 
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(because the Bij and Zi are independent and E[i?jj] = 1/ri). To apply Lemma 10 we compute 



From Lemma 



10 



we have (setting t = ^\\g\\i), 

Pr[X < l\\g\\^] < e-^^Mll/Ml < g"^^'^' 
because 115111/115112 ^ and s = (96ri)^. The result follows because ||niy||-|^ > X. □ 

Running Time. The running time follows from the time to compute the product HgX for a 
Hadamard matrix Hs, which is O(slogs) time. The time to compute Hy is dominated by n/s 
computations of HsZi, which is a total of 0(^ • slogs) = O(nlogs) time. Since C is diagonal, 
pre-multiplying by C is 0(n) and further pre-multiplying by B takes time 0{nnz{B)), the number 
of non-zero elements in B (which is 2n). Thus the total time is 0(nlogs + n) = 0(n log ri) as 
desired. 

4 Algorithmic Applications of the Fast Cauchy Transform 

In this section, we describe three related applications of the FCT. The first is to the fast construction 
of an ^i-well-conditioned basis and the fast approximation of ^i-leverage scores; the second is to a 
fast algorithm for the least absolute deviations or ^i-regression problem; and the third is to a fast 
algorithm for ^i-norm subspace approximation. 

4.1 Fast construction of an £i-well-conditioned basis and £i-leverage scores 

We start with the following definition of a basis that is "good" for the ii norm in a manner that is 
analogous to how an orthogonal matrix is "good" for the £2 norm. 

Definition 1 (Well-Conditioned Basis (adapted from [7J)). A basis U for the range of A is (a,/3)- 
conditionedif\\U\\i < a andforallx G , \\x\\^ < /3||f7x||i. We will say that U is well- conditioned 
if a and f3 are low-degree polynomials in d, independent of n. 

Remark. An Auerbach basis for A is (d, l)-conditioned, and thus we know that there exist well- 
conditioned bases for ^1. More generally, well-conditioned bases can be defined in any -^^-norm, 
using the notion of a dual norm £*, and these have proven important for solving d.^ regression 
problems [7j. Our focus here is the ^i-norm, for which the dual norm is the ^oo-norm. 

Our main algorithm for constructing an ^i-well-conditioned basis is from [21] and is summarized 
in Figure [1} Let Hi G ^''i^" be any projection matrix such that for any x G W^, 

\\Ax\\^ < WHiAxW-^ < k\\Ax\\-^. (2) 

For example, it could be constructed with either of the FCT constructions described in Section |3j 
although it need not be. Then, the algorithm of Figure [T] consists of the following steps: first, 
construct and an R such that IIi^ = QR^ where Q has orthonormal columns (for example 
using a QR-factorization of LLi^); and then return U = AR~^ = A[Q^IliA)~~^ . The next theorem 
directly follows by combining our Theorems [o] and [l] with Theorems 9 and 10 of ^21j . 
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Fast-£i-Basis(A): 
1: Let Hi be an ri x n matrix satisfying ([2| 



Compute UiA e IR'^i^'' and its QR-factorization: UiA = QR with Q^Q = /. 
Return [/ = = A(Q'^niA)-i 



Figure 1: Fast construction of an ^i-well conditioned basis from 

Theorem 2. For any A G M"^'^, the basis U = AR~^ constructed in Figure [7] using any Hi 
satisfying ^ is a (d^/ri, k)- conditioned basis for the range of A. If Hi in Figure^is obtained 
from Theorem^ the resulting U is an {a, (3) -conditioned basis for A, with a = 0((i^/^ log^^^ and 
/3 = 0{d'^log'^ d), and the time to compute the change of basis matrix is 0{nd\ogd-\- d^logd). 

Proof. Clearly U = AR^^ is in the range of A and has the same null-space otherwise IIi^ would 
not preserve lengths to relative error. Therefore C/ is a basis for the range of A. Consider any 
G R'^. The first claim of the theorem follows from the following derivations: 

(«) .. .. _.. -, .. (b) 



|C/||i = \\AR-^\\^ < \\I{iAR-^\\-^<^\\]liAR-^\\^ = d^;a.nd 



I lloo 



< ||x||2 = ||niAi?"^a;||2 < \\IiiAR'''^x\\^<K\\AR''^x\\-^ = k\\Ux\\^. 



(a) follows from the lower bound in ([2]), because it holds for every column of AR~^; (b) follows 
because by the construction of R, Ili^i?"^ has d orthonormal columns; finally, (c) follows from the 
upper bound in ([2]). 

Finally, if Hi satisfying ([2]) is constructed using Theorem [l] with small fixed probability of failure 
5, then ri = 0{d\ogd) and k = 0{d^log^ d). The running time to compute R^^ is obtained by 
summing 0{ndlogd) (to compute Hiyl) and 0{rid'^) = 0{d'^logd) (to obtain R^^ £ M'^^'^). □ 



Remark. Our constructions that result in Hi satisfying ([2]) do not require that A G M"^*^; they 
only require that A have rank d, and so can be applied to any A £ Jg^x*^ having rank d. A small 
modification is needed in construction of U because R G M'^^ '" and so we need to use R^ instead of 
R^^. The running time will involve terms with m. This can be improved by processing A quickly 
into a smaller matrix by sampling columns so that the range is preserved (as in [21]). which we do 
not discuss further. 

The notion of a well-conditioned basis plays an important role in our subsequent algorithms. 
Basically, the reason is that these algorithms compute approximate answers to the problems of 
interest (either the £i-regression problem or the £i-subspace approximation problem) by using 
information in that basis to construct a nonuniform importance sampling distribution with which 
to randomly sample. This motivates the following definition. 

Definition 2 (£i-leverage scores). Given a well- conditioned basis U for the range of A, let the 
n-dimensional vector A, with elements defined as Aj = median^ be the ii-leverage scores of A. 

Remark. The name ii-leverage score is by analogy with the £2-leverage scores, which are important 
in random sampling algorithms for £2 regression and low-rank matrix approximation |18[ [TU] |^ As 
with £2 regression and low-rank matrix approximation, our result for ii regression and ii subspace 
approximation will ultimately follow from the ability to approximate these scores quickly. 



^Note that these scores are not well-defined for a given matrix A, in the sense that the £1 norm is not rotationally 
invariant, whereas for £2 any orthogonal matrix spanning a given subspace leads to the same ^2-leverage scores. We 
will tolerate this ambiguity since these scores will be used to construct an importance sampling distribution, and thus 
up to low-degree polynomial factors in d, which our analysis will take into account, it will not matter. 
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4.2 Fast ii regression 

Next, we consider the ii regression problem, also known as the least absolute deviations problem, 
the goal of which is to minimize the £i norm of the residual vector Ax — b. That is, given as input 
a matrix A G M"^'^, with n > d, and a target vector b G M"", compute 

Z = min ||6 — (3) 

and an x* achieving this minimum. We start with our main algorithm and theorem for this 
problem; we then describe how a somewhat more sophisticated version of the algorithm yields 
improved running time bounds; and, we finally describe an extension of the problem to include 
multiple right hand sides. 

4.2.1 Our main algorithm and result for fast £i regression 

Prior work has shown that there is a diagonal sampling matrix D with a small number of nonzero 
entries so that x = aigmm^^^d\\D{Ax — b)\\i satisfies 

\\Ax-b\\i < {1 + e)\\Ax* - b\\i, 

where x* is the optimal solution to Problem ([s]) ; see [TJ |2T] . The matrix D can be found by sampling 
its diagonal entries independently according to a set of probabilities pi that are proportional to the 
£i-leverage scores. Here, we give a fast algorithm to compute estimates pi of these probabilities. 
This permits to develop an improved algorithm for £i regression and to construct efficiently a small 
coreset for an arbitrary ii regression problem. 

Figure [2] presents the details of our FastCauchyRegression algorithm, which we summarize here. 
Let X = [A —6] . First a matrix Hi satisfying ^ is used to reduce the dimensionality of X to 
UiX and to obtain the orthogonalizer R^^. Let U = XR^^ be the resulting well conditioned basis 
for the range of X. The probabilities we use to sample rows are essentially the row-norms of U. 
However, to explicitly compute XR~^ takes 0{ncP) time which is already too costly, so we need 
to estimate without explicitly computing U. Thus, to construct these probabilities quickly, 

we use a second random projection II2 on the right. This second projection, allows us to estimate 
the norms of the rows of XR~^ efficiently to within relative error, which is all we need. (Note that 
this is similar to what was done in [1Q\.) These probabilities are then used to construct a carefully 
down-sampled (and rescaled) problem the solution to which will give us our (1 -|- e) approximation 
to the original problem. 

The next theorem summarizes our main quality-of-approximation results for the FastCauchyRe- 
gression algorithm. It improves the 0{n(P -\- poly (de~^ log n)) algorithm of [21], which in turn 
improved the result in [7]J^ Our improved running time comes from using the FCT and a simple 
row-norm estimator for the row-norms of a well-conditioned basis. 

Theorem 3 (Fast Cauchy ^i-Regression). Given are e G (0,1), p > 0, A e M"^'^ and b G M". 
FastCauchyRegression(j4, 6) constructs a coreset specified by the diagonal sampling matrix D and a 
solution vector X G that minimizes the weighted regression objective \\D{Ax — b)\\-Y- The solution 
X satisfies, with probability at least 1 — 

■^Technically, their running time is 0(nd'^^ ^^+poly(de~^ log n)), where oj"*" is any constant larger than the exponent 
for matrix multiplication. For practical purposes, we can set lo^ = 3. 
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FastCauchyRegression(A, b): 

1: Let X = [A —b] G ]]jnx(d+/c) g^j^jj construct IIi, an ri x n matrix satisfying ^ 

with A replaced by X. (If & is a vector then A; = 1.) 
2: Compute X' = IliX e M'-i x (''+'=) , and its QR-factorization, UiX = QR together 
with . (Note that R"^ is such that I{iXR~^ has orthonormal columns 
Let n2 G k('^+'^)>^'"2 be a matrix of independent Cauchys, with r2 = 15 log 
Let U = XR-^ and construct A = [/Ha G M"''''^ 
For i e [n], compute = medianjgr2 
6: For z e [n] and s = ^) 
probabilities 



2n 
5 ■ 



4(i^/ri^ max( (d+A:) ^/r7,K:) 



log I ) , compute 



Pi = mm ■ 



7: Let D e 



be diagonal with independent entries: Da 



j: prob. p,; 



prob. 1 — Pi. 

8: Return £ e M'^ that minimizes ||_Dyla; — Db\\^ w.r.t. x (using linear programming). 



Figure 2: Our ^i-regression algorithm. We remark that in Step 6, we are sampling rows of A 
and b so that the expected number of rows sampled is at most s. Instead of this version of 
independent sampling (without replacement), we could also sample exactly s rows independently 
with replacement according to the probabilities pi = Ai/Eie[n] and all our results continue to 
hold up to small factors. 

Further, with probability 1 — o(l), the entire algorithm to construct x, runs in time 
0{nd\ogn + (t){S,d)) = O [ndlogn + ^ poly (d, log ^)) , 

where S = O ^^d^'^s log2 (^)^ and (j){S,d) is the time to solve an ii-regression problem on S 
vectors in d dimensions. 

Remarks. Several remarks about our results for the ii regression problem are in order. 

• Our methods extend to £p regression for p > 1, as discussed in ^ 

• Second, we can further improve the efficiency of solving ii regression, thereby replacing the 
ndlogn term in Theorem [s] with ndlog{d£~^ logn), but at the expense of a slightly larger S. 
The improved algorithm is essentially the same as FastCauchy Regression with two differences: 
112 is chosen to be a matrices of i.i.d. Gaussians, for a value r2 = 0{log{de~^ logn)); and, to 
accomodate this, the size of s needs to be increased. Details are presented in Appendix [Ej 

• Third, ii regression can be used to solve ii hyperplane approximation. A natural extension of 
our algorithm to matrix valued b gives a (1 + e) approxination for £i subspace approximation 
in a similar running time. See Section [4.3| for details. 

4.2.2 Proof of Theorem M 

For X £ M"^"?, we analyze the more general constrained ii regression problem, min^jgc and 
we show that x £ C constructd by our algorithm is a (1 + e)-approximation for this more general 
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problem: 

WXxL < (1 + e)min||Xx*|L. 

(That is, we actually prove a somewhat stronger result than we state in Theorem |3j This more 
general problem involves calling our main algorithm with b = {} (NULL), and then incorporating 
the constraint that x £ C into the optimization problem that is solved as a black box in the last step. 
Of course, if the constraint set is not a polytope, then the last step may involve more sophisticated 
techniques than linear programming.) The classic £i regression is obtained by setting X = — 
{q = d + 1) with constraint C = {x : e^j^-^x = 1}, which corresponds to setting Xd+i = 1. 

The main ingredients in our proof follow a similar line to those in [6l[7l[2I]. We use the notation 
in Figure [2| From Step 1, Hi satisfies ^ with A <^ X, i.e. Hi preserves the ^i-norm of vectors in 
the range of X: 

\\Xx\\-^ < \\UiXx\\^ < k||Xx||i. (4) 

Let C = {y = R~^x : x G C} be a linear transform of the constraint set. We start with a basic 
lemma that allow us to use U instead of X. This lemma says that if we can construct a sampling 
matrix D for U under the constraint C such that solving the down-sampled problem for U gives a 
(1 + e)-approximation, then that same sampling matrix works for X under the constraint C. 

Lemma 6. Let U = XR^^ and D any diagonal sampling matrix as in Figure^ Suppose that for 
any y £ C that minimizes \\DUy\\, y is a (l + e)- approximation for the problem min^gc/ ||C/y||. Let x 
be any solution to minxec \\DXx\\. Then x is a (l + e) -approximation for the problem min^gc 

Proof. Select y = R^^x G C . For any y G C , there is some x £ C with y = R^^x, and we have: 

(a) 

\\DUy\\^ = \\DUR-^x\\^ = \\DXx\\^>\\DXx\\^ = \\DUR-^x\\^ = \\DUy\\^, 

where (a) is by the optimality of x. So, y minimizes ||-DC/y||-|^, hence for all y G C, < 
(1 + e)||C/y||;^. Now consider any x G C and let y = R~^x G C . Then, 

= \\UR-^x\\^ = \\Uy\\^ < (l + e)||;7y||i = \\UR-^x\\^ = \\Xx\\^. 

□ 

Remarks. We emphasize that our proof accomodates an arbitrary constraint set C. For the 
classical li regression problem, that is of interest to us in Theorem [3| we only need C to be 
specified by a single linear constraint. In the remaining, we will work with U and show that our 
algorithm generates a coreset that works, regardless of the constraint set C 

By Theorem [2| U = XR~^ is an (a, /3)-conditioned basis for the range of X, where 

^ Q\/^, and (3 < n. 

So, \\U\\^ < q^/ri and for all y G W^, ||y|loo — '^ll^ylli- Jiext show that Aj estimates ||?7(j)||^. The 
following lemma is a straightforward application of a Chernoff bound to independent half-Cauchys 
(see also Claims 1 and 2 and Lemmas 1 and 2 in [15j). 

Lemma 7. Let Zi, . . . , be r2 independent Cauchys. Then, ^ < median{|Zi|, . . . , iZ^jl} < | 
with probability at least 1 — 2e~'^'^'^, where c > 2(tan~^(g))^ > 0.07 is a constant. 
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Fix i and for j G [r2] define the random variables Zj = Ay- to apply Lemma ^ Observe that for 
j € [r2], Zj = Aij = [7(^)112 are independent Cauchy random variables scaled by Applying 
Lemma [7] with Aj = medianjg^j l^ijl; have that with probability at least 1 — 2e~^'^'^ 

By a union bound, these inequalities hold for all i £ [n] with probability at least 1 — 2ne~^''^ > 1 — S 
for r2 > Mog ^ (since ^ < 15, our algorithm satisfies this condition). Next we show that if the 
sampling matrix preserves the ^i-norm of every vector in the range of U, then we are done. 

Lemma 8. Given D with n columns, Suppose that for all y £ M^, 

{l-e)\\Uy\\,<\\DUy\\,<{l + e)\\Uy\\„ (5) 
and suppose that y is a solution to miny^c' \\DUy\\. Then, for all y £ C , 

Proof. Since D preserves norms, for any y £ C we have that: 



< \\DUy\\i<, \\DUy\\, < —\\Uy\\,. 

I — e 1 — e 1 — e 

(a) is by the optimality of y. □ 

The remainder of the proof is to show that D from our algorithm in Figure [2] satisfies the pre- 
condition of Lemma [sj namely ([s]). We need two ingredients. The first is the ^i-sampling lemma 
Lemma [Sj The second ingredient is a standard 7-net argument. 

We are going to apply Lemma [3] with Z = U, for which we constructed Aj such that, with 
probability at least 1 — 5, ^ > g. Since U is (a, /3)-conditioned, > ;g||y|loo' ^^'^ ll^lli ^ 

and so we have that with probability at least 1 — S, 

{l-e)\\Uy\\,<\\DUy\\,<{l + e)\\Uy\\„ 

where 5 < 2 exp ^ (g^2e)Qff ) ' — '^iV^- If y = then the bounds trivially hold; by 

rescaling, it thus suffices to show the bound for all y G M'^ with \\y\\^ = 1- We now show this 
using a standard 7-net argument. Consider the uniform lattice on M'^ specified by T = 2^'^. Let 
H = {z : \\z\\^ = 1} n T be the restriction of this grid to only its points with ^oo-norm equal to 1; 
l-f^l ^ (tt)*^- Consider any y with \\y\\^ = 1, and let h be the closest grid point in H to y. Then 



y 



h+-Y.C^e^, (6) 



q 



i=l 



where < \Q\ < 1. Observe that ej G H. By a union bound, for every h G H, with probability at 
least 1 — S, 

{l-e)\\Uh\\, < \\DUh\\, < {l + e)\\Uh\\„ 
where S < 2|T| exp ^ (Q+2e)ai3 ) • condition on this high probability event. Then, 



\\DUy\\ 



DUh +-Y^ CiDUer 
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Applying [/ to both sides of (joj) and using the triangle inequality, < X]i=i |Ci| ll^^dli- 

We conclude that 



\\DUy\\, < (1 + e) [\\Uy\\, + ^ \\Ue^\,^ < (1 + e)\\Uy\\, (l + ^ j 

where we used \\Uy\\^ > ^||y||oo = ^ (since \\y\\^ = 1) and I]^=i ||t^ei||^ = \\U\\-^ < a. In an 
analogous way, we get the lower bound: 



\\DUy\\, 



DUh + - V CiDUe^ 

n ^ — ' 



i=l 



> \\DUh\\,-^^\Ci\\\DUei\\, 



> 



[l-e)[\\Uh\\,-lY.\\^^^\\i 



1=1 



Again, applying [7 to (p| and using the triangle inequaltiy gives > ^ Yli=i ICilllf^^illi- 

Further, ||f/y||]^ < ||C7|l7l|y||oo < a, and so we have 



\\DUy\\,>{l-e) 



27 



1=1 / 

Setting 7 = ge/(2amax(a,/3)), using (1 + e)^ < 1 + 3e and (1 — e)^ > 1 — 3e (for e < 1), and 
rescaling e by dividing by 3, we obtain that with probability at least 1 — S, 

{l-e)\\Uy\\,<\\DUy\\,<{l + e)\\Uy\\,. 

where 5 < 2\T\ exp ^ 9(64^^/3)0.^ ^ > and \T\ < ( ^° ™^i<^,P) ^g^ Solving for s using a < dyjr\ and /3 < k, 
and simplifying a little, we require 

%'iKqJT\ ( 4gA/ri max(g^/ri, k) 2 
s > ^2 ( 5 log ^ ^ log ^ 

The total success probability is 1 — 2(5, which results from a union bound applied to the two random 
processes involving 112 and D. This concludes the proof of the correctness. 



Running Time Set b = for p = 0(1). We compute the running time as follows. In Step 2, 
if we use Theorem [o] for Hi (which succeeds with probability 1 — 5), the time to compute IIiX is 
0{nqlogq) and ri = 0{qlogq) and k = O [q^'^'^ log q)) (ri and k affect the running time of later 
steps); We need to compute an orthogonal factorization in O(rig^) and then compute in 0{q^) 
for a total run time of Step 2 that is 0((n + q^)qlogq). In Step 3, r2 = O(logn) so the time to 
compute A = XR~^Il2 is in 0{nqr2 + ^2^^) = 0(ng logn + q^logq). Since computation of the 
median of r2 elements is in 0{r2), computing all Aj takes 0{nr2) = O(nlogn) time. Thus, the 
running time for Steps 1-5 is 0(ng log n) + q^ logq. 

Note that k > {q)y/ri, so in Step 6, s = O ^e^^g'^^a log^(|)^ It takes 0{n) time to sample 

the diagonal matrix D and then 0{qS) time to construct DA and -D6, where S is the number of 
non-zero entries in D. Lastly Step 8 takes (j){S,d) = Q{dS) time to solve the ii regression on the 
smaller problem with s rows in d dimensions. The total running time is thus: 

O {nqlogn + (j){S, q)) . 
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where E[S] <s = [(-'^qP+-2 log2(f )j and S is very tightly concentrated around s (via a stan- 
dard Bernstein bound) because it is the sum of independent binomial variables; specifically, with 

3 

probability at least 1 — e~8*, S < 2s. Hence S = 0{s) with probability 1 — o(l). The probability 
of success is 1 — 35 = 1 — ^ (union bound). Since s = e~^g'''^2 poly(log |), and since standard 

algorithms for linear programming give 4>{S, q) = Sq^^^\ we have the result claimed in the theorem 
for q = 0{d). 

4.3 ^i-norm subspace approximation 

Finally, we consider the ii-norm Subspace Approximation Problem: Given the n points in the nx d 
matrix A and a parameter k £ [d — 1], embed these points into a subspace of dimension k to obtain 
the embedded points A such that ||A — is minimized. (This is the ii analog of the £2 problem 
that is solved by the Singular Value Decomposition.) When k = d — 1, the subspace is a hyperplane, 
and the task is to find the hyperplane passing through the origin so as to minimize the sum of ii 
distances of the points to the hyperplane. The crucial observation made in [5] (see also Lemma 
18 of [21]) is that this problem can be reduced to d t\ regressions of A onto each of its columns 
(multiple regression). 

We can extend our l\ simple regression algorithm to i\ multiple regression (see the Appendix) 
from which we obtain our result for subspace approximation. 

To see how to address this problem, consider the following £i-regression problem: 

min \Aw — J^^^ 

w.Wj=0 

This regression problem is fitting (in the ^i-norm) the jth. column of A onto the remaining columns. 
Let Wj be the optimal solution. Then if we replace A^^') by Awj, the resulting vectors will all be 
in a d — 1 dimensional subspace. Let Aj be A with A^^^ replaced by Awj. The crucial observation 
made in [5] (see also Lemma 18 of [21]) is that one of the Aj is is optimal — and so the optimal 
subspace can be obtained by simply doing a hyperplane fit to the embedded points. So, 

min II A — |L = min ||^ — A||-^. 

j(^[d] rank(A)=d-l 

When viewed from this perspective, the £i-norm subspace approximation problem makes the con- 
nection between low-rank matrix approximation and overconstrained ii regression; a similar ap- 
proach was used in the £2 case to obtain relative-error low-rank CX and CUR matrix decompo- 
sitions [11]. We thus need to perform k constrained regressions, which can be formulated into a 
single constrained multiple regression problem, which can be solved as follows: Find the matrix W 
that solves: 

min \\AWL, 
w&c ^ 

where the constraint set is C = {W G R'^^'^ : Wa = — 1}. Since the constraint set effectively places 
an independent constraint on each column of W, after some elementary manipulation, it is easy 
to see that this regression is equivalent to the d individual regressions to obtain w*. Indeed, for 
an optimal solution W* , we can set Wj = W*^^\ Note that this constrained ii multiple regression 
is exactly the type of optimization problem studied in the proof of Theorem [3j for which we have 
already developed a coreset-based algorithm. 
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4.3.1 Generalizing li regression to multiple regression 

The multiple li regression problem is similar to the vanilla ii regression problem, except that 
it involves solving for multiple right hand sides, i.e., both x and h become matrices {W and 
respectively). Specifically, let A £ M"^'^ and B G M"^'=. We wish to find W £ M^^'^ which minimizes 

\\AW-B\\^. 

It is easy to see that the optimal W can be obtained by solving k separate simple regressions, with 
b = for j G [k]. But one can do better. As we did with simple regression, we can reformulate 
the more general constrained optimization problem: 

min llXZlli . 
Zee ^ 



To recover multiple ii regression, we set 

X =[A -B] ; Z 



W' 
h 



So the constraint set isC = {.Z^ = [^] : W £ M°'^'^|. A closer inspection of the proof of Theorem [s] 



in Section 4.2.2 reveals that nowhere is it necessary that x be a vector. So the whole proof generalizes 
to a matrix Z. Specifically, Q continues to hold since if it holds for every vector x, then it must 
hold for a matrix Z because 

ll^^lli = Eie[fe] 11^^^^^ 111- Similarly, if Lemma g continues to hold for 
vectors then it will imply the desired result for matrices, and so the only change in all the algorithms 
and results is that the short dimension of X changes from d + 1 to d + k. Further, by shrinking 
6 by an additional factor of k, and taking a union bound we get a relative error approximation 
for each individual regression. We refer to this modified algorithm, where a matrix B is input and 
the optimization problem in the last step is modified appropriately, as FastCauchy Regression (yl, B), 
overloading notation in the obvious way. The conclusion is summarized in the following theorem. 

Theorem 4 (Fast Cauchy Multiple ii Regression). Given e G (0,1), p > 0, a matrix A G M"^'' 
and B G M"^'^, FastCauchyRegression(yl, i?) constructs a coreset specified by the diagonal sampling 
matrix D and a solution vector W G M'^ that minimizes the weighted multiple regression objective 
\\D{Ax — b)\\i. The solution W satisfies, with probability at least 1 — pq^^? 

< (y^) \\Ax-B^^^\\^, Vx gM^ and^j G [k]. 

Further, with probability 1 — o(l), the entire algorithm to construct W , runs in time 

O {n{d + k) log n + 4>{S, d,k)) , 

where S = O ^^(d + A;)^^^ log2(^^)^ and (j){S,d,k) is the time to solve k £i-regression problem 
on the same S vectors in d dimensions. 

To solve the £i-subspace approximation problem, we will use this version of £i multiple regression, 
which is more efficient that solving k separate £i-simple regressions. 

Remark. We can save an extra factor of {d + A;) in S" in the above theorem if all we want is a 
relative error approximation to the entire multiple regression and we do not need relative error 
approximations to each individual regression. 
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Remark. When k = 0(d) it is interesting that there is essentially no asymptotic overhead in 
solving this problem other than the increase from (p{S, d) to 0(5', d, k); in general, by preprocessing 
the matrix DA, solving k regressions on this same matrix DA is much quicker that solving k 
separate regressions 

4.3.2 Application to ^i-norm subspace approximation 

Using our approximation algorithm for constrained multiple ii regression, we can build an approx- 
imation algorithm for the £i-norm subspace approximation problem. Our algorithm improves upon 
the previous algorithm in ^Tj, which was Qlnd'^^ + poly{de~^ logn)), where lo « 2.376 and /3 > 
is any constant. The algorithm is basically our multiple £i regression algorithm, FastCauchy Regres- 
sion invoked with A and b = {} (NULL). The algorithm proceeds exactly as in Figure [2| except for 
the last step, which instead uses linear programming to solve for W that minimizes HAH^H^ with 
respect to W G C (the constraints defining C are very simple affine equality constraints). Given 
W, we define wj = W^^^ and compute j* = argmin^gj^j \\A — Aj\\ where Aj is A with the column 
j^U) replaced by Awj. It is easy to now show that Aj* is a (1 -|- e)-approximation to the d — 1 
dimensional subspace approximation problem. Indeed, recall that W* is optimal and the optimal 
error is ||;^ for some j £ [d]; however, for any j £ [d]: 

where (a) is from the (1 -|- e)-optimality of the constrained multiple regression as analyzed in 



Section 4.2.2 and (b) is because j* attained minimum error among all j £ [d]. This discussion is 



summarized in the following theorem. 

Theorem 5. Given A E W^^'^ (n points in d dimensions), there is a randomized algorithm which 
outputs a (1 + e)- approximation to the ii-norm subspace approximation problem for these n points 
with probability at least 1 — Further, the runing time, with probability 1 — o(l), is 

O (ndlogn-F ^poly(d,logf)) . 

5 Extension to £p for p > 1 

We give a second approach based on the FJLT that can extend to p > 1. This approach for 
constructing the FCT is spelled out in detail for £i in the Appendix. Here we show how to quickly 
obtain a well-conditioned basis for the ^p-norm, for any p G [l,oo). 

5.1 Well-conditioned bases 

Let s = Q{d + logn) and t = Q{d^). Consider the matrix F G M"*/*^", which is a block-diagonal 
matrix comprising n/t blocks along the diagonal. Each block is the sxt Fast Johnson Lindenstrauss 
matrix G. Here, for simplicity, we assume that ns/t is an integer. 



F = 



G 

G 



G 



^Compare with £2 regression where solving k regressions with the same A takes 0{ncP + ndk -\- kcP) (since the 
SVD of A needs to be done only once) versus a time of 0{nkcf) for k separate regressions. 
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We will need the following theorem. 



Theorem 6 (Theorem 5 of [7], restated). Let A be an n x d matrix, and let p G [l,oo). Then 
there exists an {a, /3 , p)-weU-conditioned basis for the column space of A such that if p < 2, then 
a = and (3 = 1; if p = 2, then a = d^/^ and /3 = 1, and if p > 2 then a = d^/^+i/p and 

(3 = d^^'^~^^P. The well-conditioned basis can be computed in 0{nd^ logn) time. 

The specific conditions satisfied by a well-conditioned basis are given (and used) in the proof of the 
theorem below. 

Consider the following algorithm Condition(^) given a fixed matrix A G M"^*^: 

1. Compute FA; 

2. Apply Theorem [6] to FA to obtain a d x d change of basis matrix U so that FAU is an 
(a, /3,p)-well-conditioned basis of the column space of matrix FA; 

3. Output AU/d-fp, where 7p = y/2t^l'P-'^l'^ for p < 2, and 7p = y/2s^l'^~^/P for p>2. 

The following is our main theorem in this subsection, which improves the 0{nd^\ogn) time of 
the algorithm of [7|, at the cost of slightly worse a and /3 factors. However, these worse factors 
will only contribute to a low-order additive poly(d) term in the running time of our £p-regression 
application. 

Theorem 7 (£p-Well Conditioned Basis). With probability 1—1/n, the output AU/djp o/ Condition (A) 

is an {a, l3y/?,d{st)^^/'P~^l'^\p)-well-conditioned basis, that is, an {a, (30{d^),p)-well-conditioned ba- 
sis. The time to compute U is 0{ndlogn). 

To prove the theorem, we will need the following lemma, which relates to how an arbitrary 
subspace L behaves under the action of G, extending Lemma [TT] to every x G L, not just a fixed x. 



The 2-norm bound can be derived using Lemma 11 (above) and Theorem 19 of [T7] by placing a 



7-net on L and bounding the size of this 7-net (See Lemma 4 of [3].) The Manhattan norm bound 
is then derived using a second 7-net argument together with an application of the 2-norm bound. 
We present the complete proof of Lemma [9] in Appendix [Cj 

Lemma 9. Let L be any (fixed) d dimensional subspace o/M*, and G an sxt matrix sampled from 
a distribution having the mJLP property. Given e E (0, g], l&t s = 36(A; + ^ + log(2ci))/c2e 
0(^ + ^)- Then, with probability at least 1 — e~^ , for every x G M*, 

Vl — f'\\x\\2 < ||Gx||2 < \/T+~e||x||2 
C3^(l - e)||x||2 < \\Gx\\^ < C3y/s{l + e)||x||2 



The constants ci, C2 and C3 in this lemma are from Definition [3j and the G in Lemmas 11 and[9j 



with the constants ci, C2 and C3 from Definition [sj is the same G used in our Construction I for H. 
The proof is given in Appendix [C) 

Proof, (of Theorem [Tj) Applying Lemma [ol we have that with probability at least 1 — 1/n, for all 
X G M'', if we consider y = Ax and write y = [zf , z^, . . . , z'^^^], then for all i E [n/t], 

2 1 1 ^2 1 1 2 — II 1 1 2 — v 2 1 1 1 1 2 
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By relating the 2- norm and the p-norm, for 1 < p < 2, we have 

and similarly, 
If p > 2, then 
and similarly, 



||GZ.||^ > > > ^itV2-l/p||,^||^ 

||Gz,||, < \\Gz,\\, < Jl\\z,\\, < Jit^'-'/^z^l, 



WGZ^I > s'/^-'/'\\Gz,\\, > .Vp-1/2^1||,^||^ > ,1/p-1/2^1 ||,^||^. 

Since \\Ax\\^ = \\y\\^ = J2i W^iW^ and = \\Gzi\\^, for p G [1,2] we have with probability 

1 - 1/n 

and for p G [2, 00) with probability 1 — 1/n 



In either case, 

P^ll^ < jp\\FAx\\^ < V^{stf/^-'/'\ \\Axl. (7) 
Applying Theorem |6j we have, from the definition of a (a, /3,p)-well-conditioned basis, that 

\\FAU\\p < a (8) 

and for all x G M'^, 

\\x\\g < PWFAUWp. (9) 
Combining ([T]) and ([s]), we have that with probability at least 1 — 1/n, 
\\AU/djp\\p < ^ \\AUi/djp\\p < \\FAUi/d\\p < a. 



Combining ([7|) and (joj), we have that with probability at least 1 — 1/n, for all x G M , 

\\x\L < l3\\FAUx\L < (3V3d(stf/P-^/'^\\\AU^x\L. 

dip 

Hence AU/djp is an (a, /3\/3c?(st)'"^''^~^''^' ,p)-well-conditioned basis. The time to compute FA is 
0{ndlogd), and since FA is 0{n/d'^) x d, the time to compute U from FA is 0{ndlogn). □ 
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5.2 £p Regression 



As with ii , a well-conditioned basis can be used in the more general ip setting to solve ip regression 
problems, via an algorithm based on sampling the rows of A with probabilities proportional to the 
norms of the rows of the corresponding well-conditioned basis. As in j ]4.2[ this entails using for 
speed a second projection 112 applied to AU on the right, as for FastCauchy Regression, to estimate 
the row norms. This allows fast estimation of the £2 norms of the rows of AU; however, we need 
the ip norms of those rows, which we thus know up to a factor of dli/2-i/p|. We uses these norm 
estimates in the sampling algorithm of [7J; as discussed for the running time bound of that paper. 
Theorem 4.1, this algorithm samples a number of rows proportional to (a/?)^, when an (a,/3,p)- 
well-conditioned basis is available. Thus factor, together with a sample complexity increase of 
^p|i/2-i/p| _ ^\p/2-i\ j^ggjgjj compensate for error due to using 112, gives a sample complexity 
increase for our algorithm over that of f7] of a factor of 

while the leading term in the complexity (for n ^ d) is reduced from 0{nd^ logn) to 0{ndlogn). 
We paraphrase and adjust Theorem 4.1 of [7] to obtain the following. 

Theorem 8. Given e G (0, 1), A G M"^'^ and b G M", there is a sampling algorithm for ip regression 
that constructs a coreset specified by a diagonal sampling matrix D, and a solution vector x G M'^ that 
minimizes the weighted regression objective \\D{Ax — The solution x satisfies, with probability 

at least 1/2, the relative error bound that \\Ax - < (1 + e)\\Ax - 6||p for all x G R'^. Further, 
with probability 1 — o(l), the entire algorithm to construct x runs in time 

0{ndlogn + (j)p{S,d)) = O (nc^logn + ^ poly (d, log f)) , 

where S = 0{e^'^d'' log{l/e)) with k = p + 7\p/2 — 1\ + max{p/2 + l,p}, and(j)p{S,d) is the time to 
solve an ip-regression problem on S vectors in d dimensions. 



6 Numerical implementation and empirical evaluation 

In this section, we describe the results of our empirical evaluation. We have implemented and 
evaluated the Fast Cauchy Transforms I & II (FCTl & FCT2) as well as the Cauchy transform 
(CT) of f2T]. For completeness, we also compare our method against two ^2-based transforms: the 
Gaussian Transform (GT) and a version of the FJLT. Ideally, the evaluation would be based on 
the evaluating the distortion of the embedding, i.e., evaluating the smallest k such that 

\\Ax\\i < ||n^x||i < Vx G M"', 

where 11 G M'"^" is one of the Cauchy transforms. Due to the non-convexity, there seems not to be 
a way to compute, tractably and accurately, the value of this k. Instead, we evaluate both £i-based 
transforms (CT, FCTl, and FCT2) and ^2-based transforms (GT and FJLT) based on how they 
perform in computing well-conditioned bases and approximating ii regression problems. 



6.1 Evaluating the quahty of i'l-well-conditioned bases 



We first describe our methodology. Given a "tall and skinny" matrix A G M"^*^ with full column 
rank, as in Section 4.1 , we compute well-conditioned bases oi A: U = A{Q^IIA)~^ , where II is one 



of those transforms, and where Q is the Q matrix from the QR decomposition of 11^. Suppose that 



19 





time 


Kl 


CT 
FCTl 
FCT2 


0{nd log 
0{nd log 


id) 
d) 
d) 


0((i4 log3/2 d) 

o{dy^ iog5/2 d) 


GT 
FJLT 


0{nd^ 
0{nd log 


) 

n) 


0(ni/2d) 
0(ni/2d) 



Table 1: Time complexity and £i-conditioning performance for £i-based and ^2-based transforms. 

U is (a, /3)-conditioned, for some values of a and /3. Our empirical evaluation is based on the metric 
ni{U) = a(3, which is referred to as the £i-condition number. Note that ki is scale-invariant: if U is 
(a, /3)-conditioned, jU is (07, /3/7)-conditioned, and hence ki{'^U) = 07/3/7 = a/3 = ki(U). This 
saves us from determining the scaling constants when implementing CT, FCTl, and FCT2. While 
computing a = \\U\\i is trivial, computing /3 = l/(min|j2|j^=x llf^-^Hi) is not as easy: it requires 
solving n linear programs: 

mm mm t^-z 1 
j=l,...,d ||z||oo < 1 

= 1 

Note that this essentially limits the size of the test problems in our empirical evaluation: although 
we have applied our algorithms to much larger problems, we must solve these linear programs if 
we want to provide a meaningful comparison by comparing our fast £i-based algorithms with an 
"exact" answer. 

Another factor limiting the size of our test problems is more subtle and is a motivation for 
our comparison with ^2-based algorithms. Consider a basis induced by the Gaussian transform: 
U = A{Q'^GA)~^ , where G S M*^('^)^" is a matrix whose entries are i.i.d. Gaussian. We know that 
i^2{U) = 0(1) with high probability. In such case, we have 

d d 
\\U\\i = Y,\\Uej\\i<Y,^^''^\W^jh<n^''^d-a'^^''{U), 

and 

\\Uz\\i > \\Uz\\2 > af'^mizh > <Tf^U)\\z\\oo. 

Hence ki{U) < n^/^d • af^''{U)/af''{U) = 0{n^/^d). Similar results apply to the FJLTs that 
work on an entire subspace of vectors, e.g., the Subsampled Randomized Hadamard Transform 
(SRHT) ^24J and the Subsampled Randomized Fourier Transform (SRFT) ^14j . In our empirical 
evaluation, we use SRHT of [24J as our implementation of FJLT, but we note that similar running 
times hold for other var iants of the FJLT [3]. Table [l] hsts the running time and worst-case 
performance of each transform on £i-conditioning, clearly showing the cost-performance trade- 
offs. For example, comparing the condition number of GT or FJLT, 0{n^^'^d), with the condition 
number of CT, 0{d^-^ log^'^ d), we will need n > 0{d^ log^ d) to see the advantage of CT over 
^2-based algorithms (e.g., n should be at least at the scale of 10*^ when d is 100). To observe the 
advantage of FCTl and FCT2 over ^2-based transforms, n should be relatively even larger. 

Motivated by these observations, we create two sets of test problems. The first set contains 
matrices of size 2^^ x 4 and the second set contains matrices of size 2^^ x 16. We choose the number 
of rows to be powers of 2 to implement FCT2 and FJLT in a straightforward way. Based on our 
theoretical analysis, we expect ^i-based algorithms should work better on the first test set than 
^2-based algorithms, at least on some worst-case test problems; and that this advantage should 
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Ai 


A2 


A3 


A4 




1.93e+04 


7.67e+05 


8.58 


112 


CT 
FCTl 
FCT2 


[10.8, 39.1] 
[9.36, 21.2] 
[12.3, 32.1] 


[10.4, 41.7] 
[15.4, 58.6] 
[17.3, 76.1] 


[10.2, 33] 
[10.9, 38.9] 
[10.9, 43] 


[8.89, 42.8] 
[11.3, 40.8] 
[11.3, 42.1] 


GT 
FJLT 


[6.1, 8.81] 
[5.45, 6.29] 


[855, 1.47e+03] 
[658, 989] 


[5.89, 8.29] 
[5.52, 6.62] 


[6.9, 9.17] 
[6.18, 7.53] 



Table 2: £i-norm conditioning, ki{U) = a/3, on matrices of size 2^*^ x 4. We compute the first and 
the third quartiles of the £i-norm conditioning nmnber in 50 independent runs for each matrix and 
each algorithm. The size is chosen to demonstrate the difference between £i-based and ^2-based 
conditioning algorithms and the superiority of the ^i-based algorithms in the asymptotic regime. 
GT and FJLT don't work weU on A2, resulting condition numbers close to the worst-case bound 
of n^/'^d = 2048. CT, FCTl, and FCT2 perform consistently across all matrices. 



disappear on the second test set. For each of these two sizes, we generate four test matrices: 
Ai is randomly generated ill-conditioned matrix with slightly heterogeneous leverage scores; A2 is 
randomly generated ill-conditioned matrix with strongly heterogeneous leverage scores; and A3 and 
A4 are two "real" matrices chosen to illustrate the performance of our algorithms on real-world 
data. In more detail, the test matrices are as follows: 

• Ai = D1G1D2G2, where Di G M"^" is a diagonal matrix whose diagonals are linearly spaced 
between 1 and 10^, Gi G W^^'^ is a Gaussian matrix, D2 G M'^^'^ is a diagonal matrix whose 
diagonals are linearly spaced between 1 and lO'^, and G2 G W^^"^ is a Gaussian matrix. 

• ^2 = . '^lisre G G M is a Gaussian matrix. 

• ^3, the leading submatrix of the SNP matrix used by Paschou et al. [20]. The SNP matrix is 
of size 492516 x 2250, from the Human Genome Diversity Panel (HGDP) and the HapMap 
Phase 3 dataset. See ^20j for more descriptions of the data. 

• A4, the leading submatrix of the Tinylmages matrix created by Torralba et al. [22] . The 
original images are in RGB. We convert them to grayscale intensity images, resulting a matrix 
of size 8e7 x 1024. 

To implement FCTl & FCT2 for our empirical evaluations, we have to fix several parameters in 
Theorems [9] and [l| finding a compromise between theory and practice. We choose ri = \2dlogc[\ 
except n = 2d for GT. We choose s = [2d log d] and t = 2d^ for FCTl, and s = 2r2iog2(2rfiogd)l foj. 
FCT2. Although those settings don't follow Theorems [9] and [T] very closely, they seem to be good 
for practical use. Since all the transforms are randomized algorithms that may fail with certain 
probabilities, for each test matrix and each transform, we take 50 independent runs and show the 
first and the third quartiles of ki in Tables [2] and [3} 

The empirical results, described in detail in Tables [2] and [3| conform with our expectations. 
The specifically designed £1 -based algorithms perform consistently across all test matrices, while 
the performance of ^2-based algorithms is quite problem-dependent. Interestingly, though, the 
^2-based methods often perform reasonably well: at root, the reason is that for many input the £1- 
leverage scores are not too much different than the ^i-leverage scores. That being said, the matrix 
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Ai 


A2 


As 


A^ 


Ki{Ai) 


4.21e+05 


2.39e+06 


36.5 


484 


CT 
FCTl 
FCT2 


[90.2, 423] 
[113, 473] 
[134, 585] 


[386, 1.44e+03] 
[198, 1. le+03] 
[237, 866] 


[110, 633] 
[114, 765] 
[106, 429] 


[150, le+03] 
[127, 684] 
[104, 589] 


GT 
FJLT 


[27.4, 31] 
[19.9, 21.2] 


[678, 959] 
[403, 481] 


[28.8, 32.3] 
[21.4, 23.1] 


[29.4, 33.5] 
[21.8, 23.2] 



Table 3: ^i-norm conditioning, ni{U) = af3, on matrices of size 2^^ x 16. We compute the first and 
the third quartiles of the £i-norm conditioning number in 50 independent runs for each matrix and 
each algorithm. The size is chosen to demonstrate that ^2-based conditioning algorithms can be as 
good as or even better than ^i-based conditioning algorithms. GT and FJLT still don't work well 
on A2, but they become comparable to £i-based algorithms. Although still performing consistently 
across all matrices, ^i-based algorithms perform much worse than in the first test set due to the 
increase of d and decrease of n. 



A2 clearly indicates that ^2-based methods can fail for "worst-case" input; while the £i-based 
methods perform well for this input. 

On the first test set, £i-based algorithms are comparable to ^2-based algorithms on Ai, A^, 
and A4 but much better on A2. The differences among ^i-based algorithms are small. In terms of 
conditioning quality, CT leads FCTl and FCT2 by a small amount on average; but when we take 
running times into account, FCTl and FCT2 are clearly more favorable choices in this asymptotic 
regime. On the second test set, £i-based algorithms become worse than ^2-based on Ai, A^, and 
A4 due to the increase of d and the decrease of n. All the algorithms perform similarly on A2; 
but £i-based algorithms, involving Cauchy random variables, have larger variance than ^2-based 
algorithms. 



6.2 Application to £1 regression 

Next, we embed these transforms into fast approximation of ii regression problems to see how 



they affect the accuracy of approximation. We implement the FastCauchyRegression of Section 4.2 
except that we compute the row norms of U exactly instead of estimating them. Although this takes 
0{nd'^) time, it is free from errors introduced by estimating the row norms of U , and thus it permits 
a more direct evaluation of the regression algorithm. Unpublished results indicate that using 
approximations to the £1 leverage scores, as is done at the beginning of the FastCauchyRegression 
algorithm, leads to very similar quality-of-approximation results. 

We generate a matrix A of size 2^^ x 7 and generate the right-hand sides h = j4a; exact + £) where 
f^exact is a Gaussian vector, and e is a random vector whose entries are independently sampled from 
the Laplace distribution and scaled such that ||e||2/||^2;exact||2 = 0.1. Then, for each row i, with 
probability 0.001 we replace hi by 100||e||2 to simulate corruption in measurements. On this kind 
of problems, li regression should give very accurate estimate, while £2 regression won't work well. 
For completeness, we also add uniform sampling (UNIF) and no conditioning (NOCD) into the 
evaluation. Instead of determining the sample size from a given tolerance, we accept the sample 
size as a direct input; and we choose sample sizes from 2^ to 2^^. 

The results are shown in Figure [3], where we draw the first and the third quartiles of the relative 
errors in objective value from 50 independent runs. If the subsampled matrix is rank- deficient, we 
set corresponding relative error to 00 to indicate a failure. We remove relative errors that are larger 
than 100 from the plot in order to show more details. As expected, we can see that UNIF and NOCD 
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Figure 3: The first and the third quartiles of relative errors in objective value. The problem size 
is 2^* X 7. The first quartiles are drawn in solid lines while the third drawn in dashed lines. If 
the subsampled problem is rank-deficient, we set corresponding relative error to oo. If a quartile is 
larger than 100, we remove it from the plot. There are few differences among those algorithms on 
Ai, A3, and A4. UNIF and NOCD are clearly inferior to algorithms that explore both conditioning 
and leverage score-based sampling. UNIF and NOCD also failed on A2 completely. GT and FJLT 
failed on A2 when the sample size is smaller than 512. CT works slightly worse than FCTl and 
FCT2 on these tests. One interesting fact from the result is that we see e ~ 1/s instead of l/s^^^. 
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are certainly not among reliable choices; they failed (either generating rank-deficient siibsampled 
problems or relative errors larger than 100) completely on A2- In addition, GT and FJLT failed 
partially on the same test. Empirically, there is not much difference among £i-based algorithms: 
CT works slightly worse than FCTl and FCT2 on these tests, which certainly makes FCTl and 
FCT2 more favorable. (One interesting observation is that we find that, in these tests at least, 
the relative error is proportional to 1/s instead of 1/s^/^. At this time, we don't have theory to 
support this observation.) This coupled with the fact that £i-leverage scores can be approximated 
more quickly with FCTl and FCT2 suggests the use of these transforms in larger-scale applications 
of £i regression. 

6.3 Evaluation on a l£trge-scale £i-regression problem 

Here, we continue to demonstrate the capability of sampling-based algorithms in large-scale applica- 
tions by solving a large-scale ii regression problem with imbalanced and corrupted measurements. 
The problem is of size 5.24e9 x 15, generated in the following way: 

1. The true signal x* is a standard Gaussian vector. 

2. Each row of the design matrix A is a canonical vector, which means that we only estimate 
a single entry of x* in each measurement. The number of measurements on the i-th entry 
of X* is twice as large as that on the (i -I- l)-th entry, i = 1, . . . , 14. We have 2.62 billion 
measurements on the first entry while only 0.16 million measurements on the last. Imbalanced 
measurements apparently create difficulties for sampling-based algorithms. 

3. The response vector is given by 



where aj is the i-th. row of A and {ej} are i.i.d. samples drawn from the Laplace distribution. 
0.1% measurements are corrupted to simulate noisy real- world data. Due to these corrupted 
measurements, £2 regression won't give us accurate estimate, and £1 regression is certainly a 
more robust alternative. 

Since the problem is separable, we know that an optimal solution is simply given by the median of 
responses corresponding to each entry. 

The experiments were performed on a Hadoop cluster with 40 cores. Similar to our previous test, 
we implement and compare Cauchy-conditioned sampling (CT), Gaussian-conditioned sampling 
(GT), un-conditioned sampling (NOCD), and uniform samphng (UNIF). Since A only has 2n 
non-zeros, CT takes 0{nd\ogd) time instead of O {n(P log d), which makes it the fastest among 
CT, FCTl, and FCT2 on this particular problem. Moreover, even if A is dense, data at this 
scale are usually stored on secondary storage, and thus time spent on scanning the data typically 
dominates the overall running time. Therefore, we only implemented CT for this test. Note that 
the purpose of this test is not to compare CT, FCTl, and FCT2 (which we did above), but to reveal 
some inherent differences among £i-conditioned sampling (CT, FCTl, and FCT2), ^2-conditioned 
sampling (GT and FJLT), and other sampling algorithms (NOCD and UNIF). For each algorithm, 
we sample approximately 100000 (0.019%) rows and repeat the sampling 100 times, resulting 100 
approximate solutions. Note that those 100 approximate solutions can be computed simultaneously 
in a single pass. 




afx* + Ci otherwise 



with probability 0.001 



1 = 1,..., 
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\\x - x*\\i/\\x*\\i 


\\x - X*\\2/\\x*\\2 


11^ X \\oo/\\^ ||oo 


CT 


[0.008, 0.0115] 


[0.00895, 0.0146] 


[0.0113, 0.0211] 


GT 


[0.0126, 0.0168] 


[0.0152, 0.0232] 


[0.0184, 0.0366] 


NOCD 


[0.0823, 22.1] 


[0.126, 70.8] 


[0.193, 134] 


UNIF 


[0.0572, 0.0951] 


[0.089, 0.166] 


[0.129, 0.254] 



Table 4: The first and the third quartiles of relative errors in 1-, 2-, and oo-norms. CT clearly 
performs the best. GT follows closely. NOCD generates large errors, while UNIF works but it is 
about a magnitude worse than CT. 




Figure 4: The first (solid) and the third (dashed) quartiles of entry-wise absolute errors. 



We first check the overall performance of these sampling algorithms, measured by relative 
errors in 1-, 2-, and oo-norms. The results are shown in Table |4} Since the algorithms are all 
randomized, we show the first and the third quartiles of the relative errors in 100 independent 
runs. We see that CT clearly performs the best, followed by GT. UNIF works but it is about a 
magnitude worse than CT. NOCD is close to UNIF at the first quartile, but makes very large errors 
at the third. Without conditioning, NOCD is more likely to sample outliers because the response 
from a corrupted measurement is much larger than that from a normal measurement. However, 
those corrupted measurements contain no information about x* , which leads to NOCD's poor 
performance. UNIF treats all the measurements the same, but the measurements are imbalanced. 
Though we sample 100000 measurements, the expected number of measurements on the last entry 
is only 3.05, which downgrades UNIF's overall performance. 

We continue to analyze entry-wise errors. Figure [4] draws the first and the third quartiles of 
entry-wise absolute errors, which clearly reveals the differences among £i-conditioned sampling, 
^2-conditioned sampling, and other sampling algorithms. While UNIF samples uniformly row-wise, 
CT tends to sample uniformly entry- wise. Though not as good as other algorithms on the first entry, 
CT maintains the same error level across all the entries, delivering the best overall performance. The 
^2-based GT sits between CT and UNIF. ^2-conditioning can help detect imbalanced measurements 
to a certain amount and adjust the sampling weights accordingly, but it is still biased. 
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To summarize, we have shown that ^i-conditioned samphng indeed works on large-scale ii 
regression problems and its performance looks promising. We obtained about two accurate digits 
(0.01 relative error) on a problem of size 5.24e9 x 15 by passing the data twice and sampling only 
100000 (0.019%) rows in a judicious manner. 

7 Conclusion 

We have introduced the Fast Cauchy Transform, an £i-based analogue of fast Hadamard-based 
random projections. We have also demonstrated that this fast ^i-based random projection can be 
used to develop algorithms with improved running times for a range of €i -based problems; and we 
have provided the first implementation and empirical evaluation of an £i-based random projection. 
Our empirical evaluation clearly demonstrates that for large and very rectangular problems, for 
which low-presision solutions are acceptable, our implementation follows our theory quite well; 
and it also points to interesting connections between ^i-based projections and ^2-based projections 
in practical settings. Understanding these connections theoretically, as well as using these ideas 
to develop improved algorithms for high-precision solutions to £i-based problems, are important 
problems raised by our work. 
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A Proofs of Technical Cauchy Lemmas 
A.l Proof of Lemma [T] 

The proof uses similar techniques to the bounds due to Indyk (l5j for sums of independent chpped 
half-Cauchy random variables. Fix M > (we will choose M later) and define the events 

Fi = {\Ci\ < M}, 

and F = rii^[m]Fi. Note that F Ci Fi = F. Using the pdf of a Cauchy and because tan"^ x < x, we 
have that: 



2 1 2 1 / 1 \ 

Pr[Fi] = - tan"^ (M) = 1 - - tan"^ t7 > 1 
vr vr \^'^ J 



irM 

By a union bound, Pr[F] > 1 - Further, Pr[F\Fi]Pr[Fi] = Pr[F n F,] = Pr[F], hence 

Pr[F\Fi] = Fr[F]/Pr[Fi]. We now bound E [\Ci\ \ F] . First, observe that 

E \Fi] = E [\Ci\ I n F] Pr[F|F,] + E | n F] Pr[F|Fi] 
> E[|Ci| I F,nF]Pr[F|F,]. 

Next, since Fj n F = F, we have that 

E m \F\ < pr[^|^^] - — — • 

Finally, by using the pdf of a Cauchy, E [|Ci||Fi] = | log(l + M^)/Pr[Fi], and so 

r, , , . ^log(l + M2) ^log(l + M2) 



Pr[F] - l-2m/7rM ' 
We conclude that 

^ ' ^ ^ ' LI »l I J - ^ 1 _ 2m/7rM 

By Markov's inequality and because Pr[X > 7t|F] < 1, we have: 

Pv[X > 7t] = Py[X > 7t|F]Pr[F] + Pr[X > jt\F]{l - Pr[F]) 
^ 2 log(l + M^) ^ 2m 



TTt 1 - 2m /-kM ttM 
The result follows by setting M = mt. 

A. 2 Proof of Lemma [2] 

To bound the lower tail, we will use the following Bernstein-type bound which is from [19]. 

Lemma 10 (pjj). Let Xi > be independent random variables with '^i'E[Xf] < oo, and define 
X = ^-Xi. For any t > 0, 

P,-[X<Em-,l<exp(^^^), 
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By homogeneity, it suffices to prove tlie result for 7=1. Let Zi = 7j min(|Cj|, M). Clearly 
Zi < TilCil and so defining Z = Zj, we have that Z < X and Pr[X < 1 — t] < Pr[Z < 1 — t]. 
Thus, we have that 

Pr[Z <l-t] = Pr[Z < E[Z] - (E[Z] - 1 + 1)] < exp (z(|g_Lt^ 



where the last step holds by Lemma 10 for 1 — t < E[Z]. Using the distribution of the half-Cauchy, 



one can verify using standard techniques that by choosing M 1.6768, E[Zj] = 7j and P^iZf] < ^jf, 
so EiH^i] = 1 and E»E[Z2] < | ^.72 < ^. It follows that Pr[Z < 1 - t] < exp (-tV^) , 
and the result follows. 

A. 3 Proof of Lemma |3] 

First, observe that ||£>Z2:||^ = Dii\Z^i)z\, and since E[Aj] = 1, E[||Z)Zz||J = X^jg[„] |%)^| = 

||Zz||j^. Next, observe that 

Dii\Z^i)z\ - \Z^:)Z\ = Y Dii\Z^)z\ - ^ |%^;|, 

ie[n] ie[n] Pi<l Pi<l 

because when pi = 1, that row must be sampled, and so does not contribute to the deviation. So, 
we only need to analyze the RHS of the above equation. From now on, we only consider those i 
with Pi < 1, in which case pi = s ■ \/Ylii!^[n\ = ^ ' ^i-, where ti > a||Z(j) Let Qi be the 

(positive) random variable Dii\Z(j^z\; either = or 

^ ^ ll%llill^lloo ^ ll%llilklloo ^ _^||^|| II II ^ 7 

Pi Pi sti as ^ °" s 

where we defined 7 = ^||Z||J|z||^. We can also obtain a bound for Ylpi<i Var[Qj]: 

\Z 

Y Var[Q,] = Y Var[Q,] < ^ E[Q2] = ^ = ^ g^j^^^^^l < l\\Zz\\„ 

Pi<i pi<i pi<i Pi<i ^* pi<i 

where, in the last inequality, we used the upper bound for Qi and we further upper bounded by 
summing over all i £ [n]. Let Q = J2i Qi with Qi < "fq] the standard Bernstein bound states that 



Pr[|Q-E[Q]| > e] < 2 exp 



-e2 



2E.Var[Q,] + |e7Q^ 
Plugging in our bounds for Var[Qj] and 7q, we deduce that 



Pr 



\DZz\\-^^ — ||Zz||^| > e\\Zz\\i 



< 2 exp 



-eHZzWl 



27 II 7»|| _i_ 2e7 II 7 II 



3s 

The lemma follows after some simple algebraic manipulations. 
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B Construction II (via a Fast Johnson-Lindenstrauss Transform) 



Let r 1 = c-d log | , s = c' • (d + log j ) , and t = s^~^^ , where the parameters c, c' > are appropriately 
large constants. We construct Hi G 



nri Xn 



Hi 



8 /vrt 



n V 2s 



• CH, 



where 



C G i^''iX"«/* is a matrix of independent Cauchy random variables; and 

H G ^s/ty.n -g ^ block-diagonal matrix comprising n/t blocks along the diagonal. Each block is 
the s X t Fast Johnson-Lindenstrauss matrix G that satisfies Lemmas 11 and |9] below. Here, 
for simplicity, we assume that n/t is an integer. 



'G 



H = 



G 



G 



Informally, the matrix H reduces the dimensionality of the input space by a very small amount 
such that the "slow" Cauchy Transform C of [21] can be applied in the allotted time. Then, since 
we are ultimately multiplying by C, the results of [21] still hold; but since the dimensionality is 
reduced, the running time is improved. For this version of the FCT, we can prove the following 
result. 

Theorem 9 (Fast Cauchy Transform (FCT II)). Suppose logn < d < n. For given 5 > Q, there 
is ri = 0{d\og^), K = 0(y (d + log j)"'^^'' log d), and a distribution over matrices Hi G M*"!^" such 
that for arbitrary (but fixed) A G M"^'^, with probability 1 — 5 it holds that for all x G , 



\\Ax\\i < l|ni^x||i < k\\Ax\ 



1- 



For any y G M", the product Tiiy can be computed in 0(n log ^) time. 

For 5 a small constant, ri = 0{d\ogd), k = 0{d'^^'^ logd) and UiA can be computed in 0{ndlogd) 
time. Thus, we have a fast linear oblivious mapping from from i— )• that has distortion 

Oi^d"^^^ log d) on any (fixed) d-dimensional subspace of M". 

Comparing with Theorem [l| Theorem [9] is a slightly worse result because ij > 0, and we require 
d > logn. This second construction has the benefit of being easily extended to constructing well 
conditioned bases for p > 1 (see Appendix [s]) . 



Remark. The requirement t > 5^+^ is set by the restriction in Lemma 11 In the bound of 
Theorem [oj k = where k' = 0{d\og{rid)) arises from Theorem 10 that appeared in [21]. If a 



stronger version of Lemma 1 1 can be proved that relaxes the restriction t > s'^^'^ , then correspond- 



ingly this bound will improve. 



31 



B.l Proof of Theorem [9] 



Preliminaries. We will need results from prior work, which we paraphrase in our notation. 

Definition 3 (Definition 2.1 of ^). For e G (0, a distribution on s x t real matrices G {s < t) 
has the Manhattan Johnson-Lindenstrauss property (mJLP) if for any (fixed) vector x G M*, the 
inequalities 

(l-e)||x||2<||Gx||2<(l + e)||x||2 
c^y/s{l - e)||x||2 < < c^y/s{l + e)||x||2 



holds with probability at least 1 — cie ^'^^^^ (w.r.t. G), for global constants ci,C2,C3 > 0. 

Remark. This is the standard Johnson-Lindenstrauss property with the additional requirement 
on Essentially it says that Gx is a nearly uniform, so that HGxH^ ~ -^5110x112 ~ -^||x||2. 

Lemma 11 (Theorem 2.2 of [2j). Let ij > be an arbitrarily small constant. For any s,t satisfying 
s < t^^'^~^ , there exists an algorithm that constructs a random s xt matrix G that is sampled from 

an mJLP distribution with C3 = \ -■ Further, the time to compute Gx for any 2; G M* is O(tlogs). 



We also need a result on how the matrix of Cauchy random variables G behaves when it hits 
a vector y. The next theorem is Theorem 5 of [21j. For completeness and also to fix some minor 



errors in the proof of [21], we give a proof of Theorem 10 in Appendix D 



Theorem 10. Let L be an arbitrary (fixed) subspace o/M" having dimension at most d, and C an 
ri X n matrix of i.i.d. Cauchy random variables with ri = c ■ dlog j for large enough constant c. 
Then, with probability at least 1 — 5, and for all y G L, 

\\y\\i<^JGy\\i<K'\\yh, 



where k' = 0{^\og{rid)) . 

Note that for 6 fixed to some small error probability, ri = 0{d\ogd), and the product Gy in the 
theorem above can be computed in time 0{rin) = 0{nd\ogd). 

Main Proof. We now proceed with the proof of Theorem [9j We need to analyze the product 
GHAx for aU x G W'-. Let y = Ax & M", so that y G colsp^ = {Az \ z G M'^}, and the column 
space colsp A is a d-dimensional subspace of M". Partition the coordinate set [n] into n/t contiguous 
groups of t coordinates. We will work with the block representation of y, as defined by this partition, 
i.e., with y'^ = [zf , zj, ■ ■ ■ , -z^/Ji where Zi = A(^^ij-jx and where ^({j}) is the block of t rows in A 
corresponding to the indices in Zj. Then, 



Hy 



Gzi 
Gz2 



The vector Zi G colsp noting that colsp is a subspace of M* of dimension at most d. Let 

Ui G M*^*^ be an orthonormal basis for colsp and let Zi = UiWi. Setting e = ^ in Lemma|9| 

and recalling that G is s x i, fc in Lemmajojcan be expressed as A; = — ^ — log(2c2). Applying 
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a union bound, we have that for all i E [n/t] with probability at least 1 — 2ci • j ■ exp(— j|| + ^) 
that for all y = Ax (and corresponding Zi G M*), it holds that 



2 1 1 -^^ 1 1 2 — II 1 1 2 — y 2 1 1 1 1 2 
|c3^||zi||2 < \\Gzi\\^ < |c3i/s||zj||2 . 

We will condition on this event, which occurs with high probability for the given parameters. We 
can now bound ||i^?/||x = ^i^[n/t] ll^-^dli ^ follows. 

Il^ylli = Yl \\Gzi\\i < ^cs^/s \\zi\\2 < IcsVs llzilli = |c3Vs||2/|li; (10) 

i£[n/t] je[ra/t] ieln/t] 

\\Hyh = Yl \\Gzi\\i>lc3^ Y \\zih>lc3^/l Y 11^^111 = 5^3^111^111; (11) 

ie[n/t] ie[n/t] ie[n/t] 



Since colsp HA has dimension at most d, we can apply Theorem 10 to it. We have that with 
probability at least 1 — 6, for all x G M'^, 

llFAxlli < —\\CHAx\\i < k\\HAx\\i, 
ri 



where k' = 0{4log{rid)) from Theorem 



10 



Recall that Hi = f^yJ^^CH. We now use (10) and 



(11) with y = Ax, and after multiplying by ^\/^ and setting C3 = ^J2/t:, we conclude that for all 



X G 

\\Ax\\i < ||niylx||i < 3K'\/t||Ax||i (12) 

holds with probability at least 1 - 5 - 2ci • f exp(-g| + ^) > 1 - 25 (by choosing s > ^(^ + 

log ^^)). The theorem follows because logn < d and hence n' = 0{j \ogd), s = 0{d + log and 
t = 0(52+"). 

Running Time. We now evaluate the time to compute Iliy for y G M". We first compute Hy 
which requires n/t computations of Gzi. Since s = ti/2-»?/2^ -^g invoke Lemma 11 The time to 
compute all Gzi is j ■ tlogs = nlogs. Since Hy is (ns/t) x 1, it takes 0{rins/t) time to compute 
CHy, which concludes the computation. The total running time is 0(nlogs + nris/t). Using 
logn < d, s = 0{d), t = s^^^, n = 0{dlog |) we need total time 0(nlog |). To compute IIi^, we 
need to compute HiA^^^ for d vectors A^^\ resulting in a total run time 0(ndlog ^). 

C Proof of Lemma [9] 

We will need some lemmas from prior work. The first two lemmas are on properties of a 7-net, 
taken directly from Lemma 4 of j3]. Let U G M*^*^ be a matrix whose columns are an orthonormal 
basis for L; let S be the unit sphere in and let T be the set of points in Sl defined by 

T= !^w:w e -^^'^^ WMh < 1 

where Z"^ is the d-dimensional integer lattice on L. The set T is a 7-net on Sl because every point 
in Sl is at most ^2-distance 7 from some point in T. 
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Lemma 12 (Lemma 4 of [3]). |r| < e""^ for c = + 2). 

Lemma 13 (Lemma 4 of [3j). For any d x d matrix M , if for every u,v ^ T we have lu^Mv] < e, 
then for every unit vector w, we have \vo^ Mw\ < (^il^-ji ■ 

The next lemma demonstrates that a JLP distribution preserves matrix products. 

Lemma 14 (Theorem 19 of [E]). For e G (0, let G he an s x t matrix he drawn from an 
mJLP distribution as given in Definition Then for A, B any real matrices with t rows and 
\\A\\f = \\B\\f = I, 

PrciWA^G^GB - A^B\\f > 3e/2] < cie-'''''\ 

We now prove the first part of Lemma 9} Let M be the d x d matrix M = U'^GG'^U — I, and 
let T be the 7-net with 7 = i. By Lemma 



12 



T\ < e . Let u,v £ T he any two points in T, and 
set A = Uu, B = Uv to be two matrices (actually vectors) with t rows. Since U has orthonormal 
columns, = = 1. By Lemma 14, after relabeling 3e/2 — )• e, 

PrGilu'^U^G'^GUv - u^v\ > e] < cie'^^^^^'/^ 

So, applying the union bound, for every pair x,y £ T, 

\x^U^GG^y - x^y\ = \x'^My\ < e 

holds with probability at least 1 — ci\T\'^e~'^^"^^^ . Let G be the s x t mJLP matrix constructed as 
per Lemma 11 We will now derive a bound on s for the first result (2-norm) to hold. For every 



unit-norm x in L, x = Uw for unit norm vj G M"^. By Lemma 13 (with 7 = i), for every unit vector 



w G Sl, 

\uFu^GG^Uw - \\w\\l\ < 4e. 
bmce w U^GG^Uw = \\Gx\\l and || 

^^112 ~ 11^112' 8'fter rescaling 4e — ?■ e, we have proved that with 

probability at least 1 — cie^'^e~^^^'^ 

\/l — e||a;||2 < ||Gx||2 < \/l + e||a;||2- 

We now derive the second result (manhattan norm), conditioning on the high probability event 
that the result holds for the 2-norm as proved above. Since G is an mJLP, we also have that with 
probability at least 1 — ci\T\e~'^^^'' , for every w £ T with x = Uw, 

C3\/^(l - e)||x||2 < \\Gx\\^ < C3y/7s{l + e)\\x\\2. (13) 

Now consider any unit 2-norm x £ L; x = U{w + A), where w £ T has unit 2-norm, w + A has 
unit 2-norm, and ||A||2 < 7 because T is a 7-net on S. Then, 

\\Gx\\^ = \\GUw + GUA\\-^ = \\GUw\\^ + A', 



where |A'| < ||GC/A||^. We can bound the first term on the RHS using (13). To bound the second 
term, use the 2-norm bound as follows: 



\\GUA\\^ < Vs||GC/A||2 < v/s(l + e)l|f^A||2 = ^^(1 + e)|| AII2 < \/2^7, 
(the last inequality is because e < 1). Thus, for every unit norm x £ L, 

C3^/s{l - e) - 27^5 < \\Gx\\^ < C3 Vs(l + e) + 27^8. 



34 



Choosing 7 = \T\ = exp \ 2d{l + ^ • Since C3 < 1 and £ < \, with probabiUty at least 

csVsil - 2e) < ||Gx||i < C3^/s(l + 2e). 

After rescahng 2e — t- e, the probabiUty becomes at least 1 — cie®'^/'^^^e~'^^'"^^/4 Taking a union bound 
over the 2-norm result and the manhattan norm result, and using 8d < Sd/c^e, we finally have that 
for any unit 2-norm x, both the inequalities 

(1-e) < ||Ga;||2 < (1 + e) 
C3Vs{l - e) < \\Gx\\^ < C3^/s{l + e) 

hold with probability at least 1 — 2cie*'^/'^3^e~'^2«<:2/36 _ -j^ _ g-fc^ where the last equality follows by 
setting s = 36{k + + log(2ci))/c2e^ = 0{-^ + ^). Since the result holds for any unit norm x, 
it holds for any x by scaling by ||x||2- 



D Proof of Theorem 10 



Upper Bound. First we prove the upper bound. Let U he a ii {d, l)-conditioned basis for L (see 
Section 4.1). Therefore, \\U\\-^ < d, < \\Ux\\^ for all x S M'^, and for any y ^ L, y = Ux for 

some X. Let y £ L; we have, 

\\Ry\\ = \\RUx\\i < \\RU\\i\\x\\oo < = ||i??7||i||?/||i 

Thus, it suffices to prove an upper bound on ||i??7||i. {RU)ij = J2k ^ik^kj is a Cauchy scaled 
by = ||C/(-'^||. So is a sum of rid scaled, dependent half-Cauchys with sum of scalings 

7 = IIC'^'ll = r,||C/|li. By Lemmalll 



Pr[||W||.>tnl|i/|l.]< '"'°'^'''';) + '°'^" (i + 0(i)). 

It suffices to set t = 0(| log{rid)) for the RHS to be at least 1 — 6. Since \\U\\i < d, with probability 
at least 1-5, \\RU\\^ = log(rid)). Multiplying both sides by C = 4/ri gives the upper bound. 

Lower Bound. The lower bound is essentially following the proof of the lower bound in Theorem 
5 of j21j . and so we only provide a sparse sketch of the proof. Consider an arbitrary, fixed y. The 
product CRy is distributed as a Cauchy random vector whose components are independent and 
scaled by C||y||;^. Therefore 

n 



\\CRy\\, = C\\y\\,^\Xi\, 



where Xi are i.i.d. Cauchy random variables. We now apply Lemma [2] with 7 = riC||y||^, = ri 
and setting t = ^, to obtain 



Pr 



\\CRy\\, < -riC\\y\\, 



< exp(-ri/12) . 



Since C = 4/ri, we have Pr [||Ci?y||-^ < 2||y|y < exp (— ri/12) . The result now follows by putting 
a 7- net T on L for sufficiently small 7. This argument follows the same line as the end of Section 

3 of m\. 
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It suffices to show the result for \\y\\i = 1. Consider the 7- net on L with cubes of side 7/d. There 
are {2d/'^Y such cubes required to cover the hyper-cube \\y\\^ < 1; and, for any two points 2/1,2/2 
inside the same 7/d-cube, ||yi — y2||i < 7- From each of the 7/d-cubes, select a fixed representative 
point which we will generically refer to as y*] select the representative to have ||y*||i = 1 if possible. 
By a union bound 

< (2d/7)'^exp(-ri/12). 

We will thus condition on the high probability event that ||Ci?y*||^ > \\y*\\i for all y* . We will 
also condition on the upper bound holding (which is true with probability at least 1 — 5). For any 
y €z L with \\y\\i = 1, let y* denote the representative point for the cube in which y resides (by 
construction, \\y*\\i = 1 as well). Then \\y — y*\\ < 7 and y — y* & L since y,y* £ L and L is a 
subspace. We have 

\\CRy\\, = \\CRy* + CR{y - y*)\\^ > \\CRy*\\, - \\CR{y - y*)h > VWi " " ^lli, 

where we used the upper bound in the last inequality k = ^ ■ 0{dlog{rid/p)). By choosing 7 = 
1/k and recalling that \\y*\\i = 1, we have that ||Ci2?/||]^ > 1, with probability at least 1 — p — 
exp(— ri/12 + dlog{2dK)). Recall that k = 0(| log(ri(i)), so, for c large enough, by picking ri = 

c ■ dlog y, we satisfy ^ > log ^ + d\og{2^ log(rid)), and so our bounds hold with probability at 
least 1 — 26. 



Pr 



min||Ci?y*||i/||y*||i <2 
y 



E Improving the efficiency of our £i-regression algorithm 

Here, we present an algorithm that improves the efficiency of our ii regression algorithm from 
Section |4.2[ and we state and prove an associated quality-of-approximation theorem. 



E.l Description of the algorithm and theorem 

We refer to the new algorithm as OptimizedFastCauchy Regression, and it is summarized in Figure [5] 
The algorithm will need a somewhat larger sample size S, but our main theorem will replace the 
ndlogn term in Theorem [s] with ndlog{de''^ logn). 

The intuition behind the optimized algorithm is as follows. The {i,j)-th entry {UIl2)ij will 

1 1 2 

be a 0-mean Gaussian with variance Since the row has d-dimensions, the -^2-norm and 

^i-norm only differ by Vd. Hence, at the expense of some factors of d in s, we can use sampling 
probabilities based on the ^2-norms. We will show that our estimates of these ^2-iiorms based off 
the down sampling with n2 are within poly(de~^ logn) of the true norms to finally show that the 
result of the £i-regression obtained by solving the sampled problem is (1 +e)-optimal. The difficulty 
we encounter is that some sampling probabilities may shrink by a larger factor. However, one can 
argue that with large enough probability, no row is sampled by the algorithm if its probability 
shrinks by a large factor. Therefore, the behavior of the algorithm is as if all sampling probabilities 
change by at most a poly ((ie~^ Inn) factor, and the result will follow. Here is our main theorem 
for this algorithm; it's proof is presented in the next subsection. 

Theorem 11. Given are e G (0,1), p > 0, A e M"^*^ and b G M". FastCauchyRegression(^, 6) 

constructs a coreset specified by the diagonal sampling matrix D and a solution vector x that 
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Optimized FastCauchy Regression (A, b): 

1: Let X = [A -b] e and construct Hi, an ri x n matrix satisfying ^ 

with A replaced by X. 
2: Compute X' = IliX e M,r^x{d+k)^ ^^-^^ j^-g QR-factorization, UiX = QR together 

with . (Note that is such that liiXR^^ has orthonormal columns.) 
3: Let n2 G k('^+'^)>^'"2 be a matrix of independent Gaussians, with r2 =?. 
4: Let U ^ XR-^ and construct A = [/Ha G M"^''^ 
5: For i G [n], compute = mcdianjgr2 l-^^ul 
6: For i G [n] and s =?, compute probabilities 



Pi = min < 1 , s • 



A,; 



ie[n] 



7: Let D e M"^" be diagonal with independent entries: Da — \'p^ ^ 

I prob. 1 — Pi. 

8: Return 2; G that minimizes ||-DAx — Db\\-^ w.r.t. x (using linear programming) 



Figure 5: Our optimized ^i-regression algorithm 



minimizes the weighted regression objective \\D{Ax — b)\\^. The solution x satisfies, with probability 
at least 1 — -g^, 

\\Ax - b\\^ < (^^^^ \\Ax - b\\^, VxGM'^. 
Further, with probability 1 — o(l), the entire algorithm to construct x, runs in time 

O {nd log(de~"'^ log n) + poly((i, \og{de~'^ In n))) . 

E.2 Proof of Theorem [n] 

We condition on UAR^^ being a well-conditioned basis and b' being a poly((i)-approximation, 
which occurs with probability at least 99/100. Let b" = rjb' , where rj = We fix the 

randomness for Hi in the remainder of the proof. Hence, we can define the vector p S [0,1]" as 
follows: 

. A \\iAR-%)\\i + \b':\ \ 

p, = mml l,s 1 , 

where s = poly(de^^ Inn) is sufficiently large so that FastCauchy Regression succeeds with probability 
1 — 0(1) given this parameter of s. 

For two vectors x,y & (0, 1]", define their ratio distance 

f ^ f f\xi\ \yi\ 

T[x, y) = max max — r, - — r 
From the analysis of FastCauchyRegression we have the following: 

Lemma 15. Suppose p' G [0,1]" is such that T{p,p') < poly (de^^ Inn). Then there is an s' = 
poly (de""^ Inn) so that if we sample the n rows of A o b independently with the i-th row being 
sampled with probability min(l,s'p^), and solve the regression problem on the sampled rows, then 
with probability 1 — o(l), we obtain a (1 -\- e)- approximation. 
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Proof. Let A' = AR~^. For any fixed x e R'^, we have that E[||i:>(^x - 6)||i] = \\Ax - b\\i and we 
have that 

ZjW'^ii)^ - °i 111 ^ J— 7ll^(i)^ ~ 111 - ~i ' 

for all i € [n]. Hence, as in the proof of Theorem ??, by rescaling and a Chernoff bound, we have 
that for any value t = poly((ie~^ Inn), if we choose s' = poly(de^^ Inn) large enough, then we have 
\\D{Ax - h)\\i = (1 ± £)\\Ax - b\\i with probability at least 1 - {d/e)~®^'^\ and so this holds holds 
for all X G M'^ with ||x||oo < i by a union bound. As in the analysis of Theorem ??, this impies the 
solution to the problem on the sampled rows is a (1 + e)-approximation. □ 

We call a set of sampled rows for which the solution is a (1 + e)-approximation a good sample. 

Consider the vector p" S [0, 1]" used to independently sample rows in an instance of Optimized- 
FastCauchy Regression, which is a random variable determined by 112 (recall that the randomness of 
Hi is fixed). Let q" be such that 

ql = 2 ■ median^?^j^|ei^i2n2ej|. 

By 2-stability of a Gaussian and Claim 2 of pLSj, we have that for any C = 0(1) and sufficiently 
large r2 = 0{\og{de~^ Inn)), for any fixed z S [n], 



Pr 



1 



\\{AR-\^h<q'l AR-\^h 



> 1 



{de~^ Inn)'^ 



Let T C [n] be the set of indices i for which 

l-\\{AR-')^,)\\2<q':<3-\\{AR-')^,^\\2. 
Then for any i £ T, and using that 

\\{AR-%)h < ll(^^"')«lli < Vd\\{AR-%^h, 

we have 

0{Vd). 



max 



M Ml 

Ik I \Pi\ 



Consider the vector w G [0, 1]", where Wi = p'l for i £ T, and Wi = pi for i £ [n] \ T. Then, 
t{w,p) = 0{Vd), so by Lemma 15 there is an s' = poly (de~^ Inn) such that if we sample the n 
rows Aob independently according min(l, s'wi) for each i, then with probability at least 99/100, 
the sample is good. 

We now define several events. 

• AIIUpperBounded is the event that for all i, q" = 0{^/logn) ■ || (Ai?^^)(j) ||2. By the probability 
density function of a Gaussian, 

Pr[AIIUpperBounded] > 1 - 1/n^. 

• Let GoodW be the event that the rows sampled according to w are good. 

• Let GoodP" be the event that the rows sampled according to p" are good. 

• Let Bad Row be the event that there is a row i sampled according to p" for which i ^ T. 
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Then, taking the probability over the joint space of 112 and the randomness to choose which rows 
to include in the sample, we have: 

Pr[GoodP"] > Pr[GoodW] - Pr[BadRow]. 

Indeed, this follows from the fact that if the set of samples chosen by sampling according to w is 
good, and Bad Row does not occur, then this means that when solving the regression problem on 
the sample for these rows we will get the same answer, since Wi = p" for any row i eT. 

Pr[GoodP"] > Pr[GoodW] - Pr[BadRow] 

> Pr[GoodW A AIIUpperBounded] - Pr[BadRow] 

> 1 - Pr[^GoodW] - Pr[^AIIUpperBounded] - Pr[BadRow] 

> 1 - o(l) - ^ - PrfBadRow] 

> 1 - o(l) - \ - Pr[^AIIUpperBounded] - Pr[BadRow|AIIUpperBounded] 

> 1 - o(l) - Pr[BadRow| AIIUpperBounded]. 

We now bound Pr[BadRow|AIIUpperBounded]. Fix any 112 for which AIIUpperBounded holds. This 
fixes T and p". Hence, 

Pr[BadRow|AIIUpperBounded] = ^p- = O(V'logn) ■^Wi = O (^/log n) -^pi. 

i^T i^T i^T 

Now we have 

En,Ep. I AIIUpperBounded] < \ -En^Ep,] < \ • E • '^^^c • 



By definition, Y^=\ Pi ^ 

^uAY^Pi I AIIUpperBounded] < (^^-i^^^^^g - 



By setting C > sufficiently large, we have that 



Ena [y^Pi I AIIUpperBounded] = o(l/v^iogn). 

It follows by a Markov bound, 

Pr[ Pr [BadRow | AIIUpperBounded]] = o(l). 

TI2 sampling 

Hence, 

Pr[GoodP"] = 1 - 0(1). 

This completes the proof. 
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